5420 Anomaly Detection | Assignment 4 - Joyce Ng (jn2901)¶
The purpose of conducting anomaly analysis on healthcare/inpatient data is to identify and understand unusual patterns or outliers that could indicate potential issues or opportunities for improvement in healthcare delivery, cost management, and patient outcomes. Objective of this project is to perform exhaustive EDA on inpatient data and provide business insights of each feature.Then utilize KNN and PCA from PyOD to identify anomalies.¶
Desciption of dataset:
DRG Definition:
Diagnosis Related Group. Classification system that groups similar clinical conditions (diagnoses) and the procedures furnished by the hospital during the stay.
Total Discharges:
The number of discharges billed by all providers for inpatient hospital services.
Average Covered Charges:
= Total Covered Charge Amount / Total Discharges. The average charge of all provider's services covered by Medicare for discharges in the DRG. These will vary from hospital to hospital because of differences in hospital charge structures.
Average Total Payment:
= Total Payments / Total Discharges. The average total payments of what Medicare actually pays to the provider as well as co-payment and deductible amounts that the beneficiary is responsible for and payments by third parties for coordination of benefits.
Average Medicare Payment:
= Medicare Payment Amount / Total Discharges. The average amount that Medicare pays to the provider. Medicare payment amounts include the MS-DRG amount, teaching, disproportionate share, capital, and outlier payments for all cases. Medicare payments DO NOT include beneficiary co-payments and deductible amounts nor any additional payments from third parties for coordination of benefits.
Table of Contents¶
- Section 1: Data Preparation
- Section 2: EDA
- Section 3: Feature Engineering
- Section 4: K-Nearest Neighbor (KNN)
- Section 5: Principal Component Analysis (PCA)
- Section 6: Model Comparison
Section 1: Data Preparation ¶
1.1 Load Libraries and Dataset ¶
# Load libraries
import pandas as pd
import numpy as np
from functools import reduce
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.metrics import roc_curve,roc_auc_score, confusion_matrix, f1_score, accuracy_score, make_scorer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from pyod.models.combination import aom, moa, average, maximization
from pyod.utils.utility import standardizer
from sklearn import metrics
from pyod.models.knn import KNN
from pyod.models.pca import PCA
# Visiualization
import plotly.express as px
import plotly.graph_objs as go
import plotly.subplots as sp
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import plotly.io as pio
from IPython.display import display
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
data = pd.read_csv('/Users/Joyce630/Desktop/Columbia/5420 Anomaly Detection/Assignments/4 - Healthcare Fraud/inpatientCharges.csv')
data.head()
| DRG Definition | Provider Id | Provider Name | Provider Street Address | Provider City | Provider State | Provider Zip Code | Hospital Referral Region Description | Total Discharges | Average Covered Charges | Average Total Payments | Average Medicare Payments | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10001 | SOUTHEAST ALABAMA MEDICAL CENTER | 1108 ROSS CLARK CIRCLE | DOTHAN | AL | 36301 | AL - Dothan | 91 | $32963.07 | $5777.24 | $4763.73 |
| 1 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10005 | MARSHALL MEDICAL CENTER SOUTH | 2505 U S HIGHWAY 431 NORTH | BOAZ | AL | 35957 | AL - Birmingham | 14 | $15131.85 | $5787.57 | $4976.71 |
| 2 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10006 | ELIZA COFFEE MEMORIAL HOSPITAL | 205 MARENGO STREET | FLORENCE | AL | 35631 | AL - Birmingham | 24 | $37560.37 | $5434.95 | $4453.79 |
| 3 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10011 | ST VINCENT'S EAST | 50 MEDICAL PARK EAST DRIVE | BIRMINGHAM | AL | 35235 | AL - Birmingham | 25 | $13998.28 | $5417.56 | $4129.16 |
| 4 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10016 | SHELBY BAPTIST MEDICAL CENTER | 1000 FIRST STREET NORTH | ALABASTER | AL | 35007 | AL - Birmingham | 18 | $31633.27 | $5658.33 | $4851.44 |
print(data.shape, "\n") # Check the shape of df
print(data.columns, "\n") # Check column names
print(data.info(), "\n") # Check info of the df
(163065, 12)
Index(['DRG Definition', 'Provider Id', 'Provider Name',
'Provider Street Address', 'Provider City', 'Provider State',
'Provider Zip Code', 'Hospital Referral Region Description',
' Total Discharges ', ' Average Covered Charges ',
' Average Total Payments ', 'Average Medicare Payments'],
dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163065 entries, 0 to 163064
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 DRG Definition 163065 non-null object
1 Provider Id 163065 non-null int64
2 Provider Name 163065 non-null object
3 Provider Street Address 163065 non-null object
4 Provider City 163065 non-null object
5 Provider State 163065 non-null object
6 Provider Zip Code 163065 non-null int64
7 Hospital Referral Region Description 163065 non-null object
8 Total Discharges 163065 non-null int64
9 Average Covered Charges 163065 non-null object
10 Average Total Payments 163065 non-null object
11 Average Medicare Payments 163065 non-null object
dtypes: int64(3), object(9)
memory usage: 14.9+ MB
None
1.2: Check Missing Values & Change Columns Names and Data Type ¶
# Check for missing values
missing_values = data.isnull().sum()
missing_values
DRG Definition 0 Provider Id 0 Provider Name 0 Provider Street Address 0 Provider City 0 Provider State 0 Provider Zip Code 0 Hospital Referral Region Description 0 Total Discharges 0 Average Covered Charges 0 Average Total Payments 0 Average Medicare Payments 0 dtype: int64
# remove white space in
data = data.rename(columns=lambda x: x.strip())
# Change data type
data = data.rename(columns=lambda x: x.strip())
data['DRG Definition'] = data['DRG Definition'].astype('category')
data['Provider State'] = data['Provider State'].astype('category')
data['Provider City'] = data['Provider City'].astype('category')
data['Provider Id'] = data['Provider Id'].astype('category')
# Remove $ sign from columns
data['Average Covered Charges'] = data['Average Covered Charges'].str.strip("$").astype((float))
data['Average Total Payments'] = data['Average Total Payments'].str.strip("$").astype((float))
data['Average Medicare Payments'] = data['Average Medicare Payments'].str.strip("$").astype((float))
# Remove white space in columns
data.columns = ['DRG','Provider_Id', 'Provider_Name','Provider_StreetAddress','Provider_City',
'Provider_State','Provider_Zipcode','Hospital_referral_region_desp',
'Total_Discharges','Average_Covered_Charges','Average_Total_Payments',
'Average_Medicare_Payments']
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 163065 entries, 0 to 163064 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 DRG 163065 non-null category 1 Provider_Id 163065 non-null category 2 Provider_Name 163065 non-null object 3 Provider_StreetAddress 163065 non-null object 4 Provider_City 163065 non-null category 5 Provider_State 163065 non-null category 6 Provider_Zipcode 163065 non-null int64 7 Hospital_referral_region_desp 163065 non-null object 8 Total_Discharges 163065 non-null int64 9 Average_Covered_Charges 163065 non-null float64 10 Average_Total_Payments 163065 non-null float64 11 Average_Medicare_Payments 163065 non-null float64 dtypes: category(4), float64(3), int64(2), object(3) memory usage: 11.1+ MB
data.head()
| DRG | Provider_Id | Provider_Name | Provider_StreetAddress | Provider_City | Provider_State | Provider_Zipcode | Hospital_referral_region_desp | Total_Discharges | Average_Covered_Charges | Average_Total_Payments | Average_Medicare_Payments | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10001 | SOUTHEAST ALABAMA MEDICAL CENTER | 1108 ROSS CLARK CIRCLE | DOTHAN | AL | 36301 | AL - Dothan | 91 | 32963.07 | 5777.24 | 4763.73 |
| 1 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10005 | MARSHALL MEDICAL CENTER SOUTH | 2505 U S HIGHWAY 431 NORTH | BOAZ | AL | 35957 | AL - Birmingham | 14 | 15131.85 | 5787.57 | 4976.71 |
| 2 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10006 | ELIZA COFFEE MEMORIAL HOSPITAL | 205 MARENGO STREET | FLORENCE | AL | 35631 | AL - Birmingham | 24 | 37560.37 | 5434.95 | 4453.79 |
| 3 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10011 | ST VINCENT'S EAST | 50 MEDICAL PARK EAST DRIVE | BIRMINGHAM | AL | 35235 | AL - Birmingham | 25 | 13998.28 | 5417.56 | 4129.16 |
| 4 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10016 | SHELBY BAPTIST MEDICAL CENTER | 1000 FIRST STREET NORTH | ALABASTER | AL | 35007 | AL - Birmingham | 18 | 31633.27 | 5658.33 | 4851.44 |
unique_id = len(data.groupby('Provider_Id').count())
unique_providers = len(data.groupby('Provider_Name').count())
unique_states = len(data.groupby('Provider_State').count())
unique_cities = len(data.groupby('Provider_City').count())
unique_drg = len(data.groupby('DRG').count())
print(f'Unique ID = {unique_id}\n')
print(f'Unique Providers = {unique_providers}\n')
print(f'Unique States = {unique_states}\n')
print(f'Unique Cities = {unique_cities}\n')
print(f'Unique DRG = {unique_drg}\n')
Unique ID = 3337 Unique Providers = 3201 Unique States = 51 Unique Cities = 1977 Unique DRG = 100
data.describe()
| Provider_Zipcode | Total_Discharges | Average_Covered_Charges | Average_Total_Payments | Average_Medicare_Payments | |
|---|---|---|---|---|---|
| count | 163065.000000 | 163065.000000 | 163065.000000 | 163065.000000 | 163065.000000 |
| mean | 47938.121908 | 42.776304 | 36133.954224 | 9707.473804 | 8494.490964 |
| std | 27854.323080 | 51.104042 | 35065.365931 | 7664.642598 | 7309.467261 |
| min | 1040.000000 | 11.000000 | 2459.400000 | 2673.000000 | 1148.900000 |
| 25% | 27261.000000 | 17.000000 | 15947.160000 | 5234.500000 | 4192.350000 |
| 50% | 44309.000000 | 27.000000 | 25245.820000 | 7214.100000 | 6158.460000 |
| 75% | 72901.000000 | 49.000000 | 43232.590000 | 11286.400000 | 10056.880000 |
| max | 99835.000000 | 3383.000000 | 929118.900000 | 156158.180000 | 154620.810000 |
2.1 Distribution of Avg. Payments ¶
# Create a figure
fig = go.Figure()
# Add histogram for Average Total Payments
fig.add_trace(go.Histogram(x=data['Average_Total_Payments'], name='Average_Total_Payments', opacity=1))
fig.add_trace(go.Histogram(x=data['Average_Medicare_Payments'], name='Average_Medicare_Payments', opacity=0.85))
# Update layout
fig.update_layout(
title_text='Distribution of Average Total & Medicare Payments',
xaxis_title_text='Payments ($)',
yaxis_title_text='Frequency',
height=800, width=950,
template = 'seaborn',
plot_bgcolor = '#F6F5F5'
)
# Show plot
fig.show()
Based on the graph, it's notable that the average Medicare payment exceeds the overall average payment, which is unusual and potentially concerning. Therefore, further investigation is necessary to identify and understand these discrepancies.
2.1.2 Distributions of Charges and Payments by States ¶
sns.pairplot(data[['Provider_State','Average_Total_Payments','Total_Discharges','Average_Medicare_Payments','Average_Covered_Charges']],
hue= 'Provider_State')
<seaborn.axisgrid.PairGrid at 0x1685d6dd0>
The above graph show the distributions of different charges and payments on each states. There are already some outliers showed on the graph which need further investigations.
2.1.3 DRG Frequency ¶
Identify the most common DRGs by grouping the DRG and total number of discharges, then sort in an descending order.
common_drg = data.groupby('DRG')
common_drg.agg({'Total_Discharges': np.sum}).sort_values(by=['Total_Discharges'], ascending=0)[:20]
| Total_Discharges | |
|---|---|
| DRG | |
| 470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT OF LOWER EXTREMITY W/O MCC | 427207 |
| 871 - SEPTICEMIA OR SEVERE SEPSIS W/O MV 96+ HOURS W MCC | 319072 |
| 392 - ESOPHAGITIS, GASTROENT & MISC DIGEST DISORDERS W/O MCC | 244854 |
| 292 - HEART FAILURE & SHOCK W CC | 222038 |
| 690 - KIDNEY & URINARY TRACT INFECTIONS W/O MCC | 206695 |
| 194 - SIMPLE PNEUMONIA & PLEURISY W CC | 198390 |
| 291 - HEART FAILURE & SHOCK W MCC | 185599 |
| 641 - MISC DISORDERS OF NUTRITION,METABOLISM,FLUIDS/ELECTROLYTES W/O MCC | 153660 |
| 683 - RENAL FAILURE W CC | 150444 |
| 190 - CHRONIC OBSTRUCTIVE PULMONARY DISEASE W MCC | 149677 |
| 191 - CHRONIC OBSTRUCTIVE PULMONARY DISEASE W CC | 148491 |
| 312 - SYNCOPE & COLLAPSE | 141918 |
| 603 - CELLULITIS W/O MCC | 140894 |
| 378 - G.I. HEMORRHAGE W CC | 138678 |
| 313 - CHEST PAIN | 131079 |
| 193 - SIMPLE PNEUMONIA & PLEURISY W MCC | 127989 |
| 287 - CIRCULATORY DISORDERS EXCEPT AMI, W CARD CATH W/O MCC | 115920 |
| 192 - CHRONIC OBSTRUCTIVE PULMONARY DISEASE W/O CC/MCC | 114790 |
| 310 - CARDIAC ARRHYTHMIA & CONDUCTION DISORDERS W/O CC/MCC | 113454 |
| 872 - SEPTICEMIA OR SEVERE SEPSIS W/O MV 96+ HOURS W/O MCC | 112430 |
drg470 = data[data['DRG'].str.contains('470')]
The most common procedures finsihed by the hospital was 470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT OF LOWER EXTREMITY W/O MCC, In the following I created a scatter plot to take a deeper look at the average medicare payment of this group.
# Create scatter plot using Plotly Express with color mapped to 'Average_Covered_Charges'
fig = px.scatter(drg470, x='Provider_State', y='Average_Medicare_Payments',
color='Average_Medicare_Payments',
color_continuous_scale='PuBuGn',
title='Average Medicare Payments for DRG 470 by State',
labels={'Provider_State': 'State', 'Average_Medicare_Payments': 'Average Medicare Payments'})
# Customize layout
fig.update_layout(
xaxis_title='State',
yaxis_title='Average Medicare Payments',
template='plotly_white'
)
# Show the figure
fig.show()
For DRG 470, the average medicare payments seems to be within 10k to 20k, with some states i.e. Maryland, California, New York over 25k or even above 30k.
Next, I'll take a look at how much each procedures had cost by each state.
def stat(data,group_var,cont_var):
coln =list(data[[group_var,cont_var]].describe().index)
row = data[[group_var]].drop_duplicates().reset_index(drop=True).values.tolist()
row = [item for sublist in row for item in sublist]
desc_cont_var = pd.DataFrame(data[[group_var,cont_var]].groupby(group_var).describe().values.tolist(), columns=coln,index=row)
return desc_cont_var
stat(data,'DRG','Average_Total_Payments').head(n=10)
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 1079.0 | 6960.534004 | 1477.873952 | 4968.00 | 6001.8300 | 6582.890 | 7516.8250 | 18420.56 |
| 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/O MCC | 1201.0 | 6706.276445 | 2033.965862 | 4194.09 | 5412.8700 | 6093.750 | 7345.3600 | 25519.43 |
| 069 - TRANSIENT ISCHEMIA | 1659.0 | 13263.823032 | 3847.918207 | 8174.28 | 10762.2200 | 12084.700 | 14424.3250 | 50882.40 |
| 064 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W MCC | 2269.0 | 7922.671141 | 2084.658336 | 5368.73 | 6626.2700 | 7280.050 | 8503.0600 | 26510.15 |
| 065 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W CC | 1806.0 | 5713.985221 | 1342.538675 | 3916.41 | 4819.3250 | 5326.025 | 6197.4800 | 14744.05 |
| 066 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W/O CC/MCC | 1962.0 | 5068.644292 | 1114.407934 | 3514.41 | 4320.7375 | 4740.235 | 5459.7375 | 12312.80 |
| 074 - CRANIAL & PERIPHERAL NERVE DISORDERS W/O MCC | 979.0 | 6386.793177 | 1597.733831 | 4344.17 | 5278.4200 | 5903.250 | 7060.7800 | 15759.14 |
| 101 - SEIZURES W/O MCC | 1593.0 | 5493.361883 | 1420.448742 | 3704.89 | 4530.4100 | 5058.160 | 6013.8600 | 15554.42 |
| 149 - DYSEQUILIBRIUM | 988.0 | 4589.104342 | 1082.833515 | 3185.77 | 3826.4325 | 4285.315 | 5036.1925 | 9325.21 |
| 176 - PULMONARY EMBOLISM W/O MCC | 1396.0 | 7279.954979 | 1544.880403 | 4891.30 | 6238.9800 | 6848.800 | 7933.6125 | 15783.05 |
stat(data,'DRG','Average_Medicare_Payments').head(n=10)
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 1079.0 | 5555.837525 | 1236.063439 | 3592.85 | 4759.8000 | 5269.280 | 6033.6100 | 15855.18 |
| 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/O MCC | 1201.0 | 5701.676570 | 1973.620083 | 3116.42 | 4467.6600 | 5104.780 | 6293.6000 | 22873.49 |
| 069 - TRANSIENT ISCHEMIA | 1659.0 | 12112.849445 | 3702.047762 | 6603.63 | 9765.6300 | 10956.160 | 13094.6300 | 48632.28 |
| 064 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W MCC | 2269.0 | 6781.501785 | 1979.482762 | 4109.25 | 5555.9100 | 6167.510 | 7342.0300 | 23402.26 |
| 065 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W CC | 1806.0 | 4469.203560 | 1271.852222 | 2771.88 | 3653.9750 | 4079.575 | 4862.4650 | 13710.23 |
| 066 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W/O CC/MCC | 1962.0 | 3952.275102 | 1094.682773 | 2348.07 | 3240.5050 | 3627.465 | 4298.7650 | 9959.58 |
| 074 - CRANIAL & PERIPHERAL NERVE DISORDERS W/O MCC | 979.0 | 5238.579438 | 1542.402076 | 2456.83 | 4201.5650 | 4815.660 | 5862.4700 | 13310.28 |
| 101 - SEIZURES W/O MCC | 1593.0 | 4452.600709 | 1379.483972 | 2629.69 | 3535.5800 | 4012.660 | 4966.9500 | 13961.18 |
| 149 - DYSEQUILIBRIUM | 988.0 | 3440.677065 | 1076.033108 | 2012.35 | 2717.5875 | 3145.555 | 3805.2575 | 8615.31 |
| 176 - PULMONARY EMBOLISM W/O MCC | 1396.0 | 6077.948660 | 1505.418988 | 3561.68 | 5122.6200 | 5644.930 | 6554.8175 | 15161.27 |
2.1.4 Choropleth Map ¶
def plot_map(data, group_var, cont_var, stat='mean', color='RdBu_r', label=''):
# Grouping the Data
if stat == 'count':
x = data.groupby(group_var).size() # Count number of rows per group
else:
x = data.groupby(group_var)[cont_var].agg(stat) # Compute specified statistic
# Plotting Heat Map
fig = go.Figure(go.Choropleth(
locations=list(x.index),
z=x.astype(float),
locationmode='USA-states',
colorscale=color,
colorbar_title=f'{stat.capitalize()} {label.lower()}',
))
fig.update_layout(
title_text=f'{cont_var.replace("_", " ").title()} Per State',
geo_scope='usa',
height=800, width=950
)
return fig
# Plot the distribution of variables by states using choropleth map
p = [
plot_map(data, "Provider_State", "Total_Discharges", stat='count'),
plot_map(data, "Provider_State", "Average_Covered_Charges", stat='mean', label="Covered Charges"),
plot_map(data, "Provider_State", "Average_Total_Payments", stat='mean', label="Total Payments"),
plot_map(data, "Provider_State", "Average_Medicare_Payments", stat='mean', label="Medicare Payments")
]
for fig in p:
fig.show()
Section 3: Feature Engineering ¶
3.1 Avgerage Medicare Payments ¶
For this section, benchmarks are calculated by grouping the DRGs, provider states, cities and zip code from the dataset, then take the mean of the Average medicare payments.
Four features are created under this section which are Average Medicare Payments by DRG, States, Cities & Zip Code. They can provides insights into healthcare payment patterns across various levels of granularity.
3.1.1 By DRG ¶
table_1 = data.groupby(['DRG'])['Average_Medicare_Payments'].mean().reset_index()
table_1 = table_1.rename(columns={'Average_Medicare_Payments': 'Average_Medicare_Payments_byDRG'})
table_1.head(10)
| DRG | Average_Medicare_Payments_byDRG | |
|---|---|---|
| 0 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 5555.837525 |
| 1 | 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... | 5701.676570 |
| 2 | 064 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... | 12112.849445 |
| 3 | 065 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... | 6781.501785 |
| 4 | 066 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... | 4469.203560 |
| 5 | 069 - TRANSIENT ISCHEMIA | 3952.275102 |
| 6 | 074 - CRANIAL & PERIPHERAL NERVE DISORDERS W/O... | 5238.579438 |
| 7 | 101 - SEIZURES W/O MCC | 4452.600709 |
| 8 | 149 - DYSEQUILIBRIUM | 3440.677065 |
| 9 | 176 - PULMONARY EMBOLISM W/O MCC | 6077.948660 |
To better identify outliers, additional quantiles i.e 90%, 95% are added by using the following function:¶
def calculate_descriptive_stats(data, column):
'''
This function calculate descriptive statistics with additional quantile.
'''
# Calculate basic descriptive statistics
desc_stats = data[column].describe()
# Calculate the 90th and 95th percentiles
perc_90 = data[column].quantile(0.90)
perc_95 = data[column].quantile(0.95)
# Convert the descriptive statistics to a DataFrame
desc_stats = desc_stats.to_frame().transpose()
# Add the 90th and 95th percentiles to the DataFrame
desc_stats['90%'] = perc_90
desc_stats['95%'] = perc_95
# Reorder the columns to match the desired format
desc_stats = desc_stats[['mean', 'min', '25%', '50%', '75%', '90%', '95%', 'max']]
# Transpose the DataFrame
desc_stats = desc_stats.T
return desc_stats
table_1_stats = calculate_descriptive_stats(table_1, 'Average_Medicare_Payments_byDRG')
table_1_stats
| Average_Medicare_Payments_byDRG | |
|---|---|
| mean | 9248.575795 |
| min | 2876.015593 |
| 25% | 4465.052848 |
| 50% | 6391.828157 |
| 75% | 11444.865846 |
| 90% | 17047.411069 |
| 95% | 21501.661996 |
| max | 41899.432929 |
Based on the above table, payments that exceeded the 95% threshold should be considered as anomalies.
# Add average medicare payments by states to the dataset
df1 = pd.merge(data, table_1, how='left', on=['DRG'])
# Calculate the ratio of the Average medicare payments to the Average medicare payments by DRG
df1['Average_Medicare_byDRG_Ratio'] = np.where(df1['Average_Medicare_Payments_byDRG']==0,0, df1['Average_Medicare_Payments'] / df1['Average_Medicare_Payments_byDRG'])
Distribution of Average Medicare Payments By DRG Ratio¶
# Define variables and quantile range
var = 'Average_Medicare_byDRG_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df1[binned_var] = pd.qcut(df1[var], percentile)
# Create the bin labels
df1['bin_label'] = df1[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df1['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)
# Update the layout
fig.update_layout(
title='Distribution of Average Medicare Payments By DRG Ratio',
xaxis_title='Quantile Range',
yaxis_title='Count',
template = 'simple_white'
)
# Show the plot
fig.show()
This feature focuses on understanding how Medicare payments vary across different DRGs. It helps in identifying anomalies such as DRGs with unusually high or low Medicare payments compared to the overall average.
3.1.2 Per DRG by State ¶
table_2 = data.groupby(['DRG', 'Provider_State'])['Average_Medicare_Payments'].mean().reset_index()
table_2.columns = ['DRG', 'Provider_State', 'Average_Medicare_Bystates']
table_2.head(10)
| DRG | Provider_State | Average_Medicare_Bystates | |
|---|---|---|---|
| 0 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | AK | 6413.780000 |
| 1 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | AL | 4599.593043 |
| 2 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | AR | 4938.712500 |
| 3 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | AZ | 5912.832917 |
| 4 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | CA | 6878.954925 |
| 5 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | CO | 5419.930000 |
| 6 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | CT | 6709.101333 |
| 7 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | DC | 8110.816667 |
| 8 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | DE | 5719.316667 |
| 9 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | FL | 4868.594568 |
table_2_stats = calculate_descriptive_stats(table_2, 'Average_Medicare_Bystates')
table_2_stats
| Average_Medicare_Bystates | |
|---|---|
| mean | 9289.670980 |
| min | 2216.372500 |
| 25% | 4483.274762 |
| 50% | 6635.284161 |
| 75% | 11044.008750 |
| 90% | 18103.081589 |
| 95% | 24208.021478 |
| max | 63646.045000 |
# Add average medicare payments by states to the dataset
df2 = pd.merge(data, table_2, how='left', on=['DRG', 'Provider_State'])
# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df2['Average_Medicare_Bystates_Ratio'] = np.where(df2['Average_Medicare_Bystates']==0,0, df2['Average_Medicare_Payments'] / df2['Average_Medicare_Bystates'])
df2['Average_Medicare_Bystates_Ratio'].describe()
count 163065.000000 mean 1.000000 std 0.215091 min 0.339831 25% 0.868395 50% 0.954250 75% 1.071483 max 5.174415 Name: Average_Medicare_Bystates_Ratio, dtype: float64
Distribution of Average Medicare Payments Per DRG By State Ratio¶
# Define variables and quantile range
var = 'Average_Medicare_Bystates_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df2[binned_var] = pd.qcut(df2[var], percentile)
# Create the bin labels
df2['bin_label'] = df2[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df2['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)
# Update the layout
fig.update_layout(
title='Distribution of Average Medicare Payments State Ratio',
xaxis_title='Quantile Range',
yaxis_title='Count',
template = 'simple_white'
)
# Show the plot
fig.show()
This feature adds state-level granularity to the analysis allows for identifying regional disparities in Medicare payments for each DRG. It helps understand variations in healthcare costs influenced by state-level policies, demographics, and healthcare infrastructure. Payments fall witin the higher range i.e. [1.74-5.17] could be flagged as anomalies.
3.1.3 Per DRG by State & City ¶
table_3 = data.groupby(['DRG', 'Provider_State', 'Provider_City'])['Average_Medicare_Payments'].mean().reset_index()
table_3 = table_3.rename(columns={'Average_Medicare_Payments': 'Average_Medicare_Bystatescity'}).sort_values(by='Average_Medicare_Bystatescity', ascending=False)
table_3.head(10)
| DRG | Provider_State | Provider_City | Average_Medicare_Bystatescity | |
|---|---|---|---|---|
| 2126950 | 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... | CA | STANFORD | 154620.81 |
| 9244283 | 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... | NY | VALHALLA | 133177.26 |
| 2125896 | 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... | CA | FREMONT | 113462.09 |
| 9285667 | 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | CA | STANFORD | 109303.21 |
| 9345110 | 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | NY | VALHALLA | 107456.18 |
| 5312030 | 329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC | NY | VALHALLA | 100040.26 |
| 6831303 | 469 - MAJOR JOINT REPLACEMENT OR REATTACHMENT ... | PA | HERSHEY | 99114.86 |
| 2186393 | 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... | NY | VALHALLA | 98792.46 |
| 9184840 | 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... | CA | STANFORD | 94614.03 |
| 9361733 | 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | TX | GALVESTON | 89857.83 |
table_3_stats = calculate_descriptive_stats(table_3, 'Average_Medicare_Bystatescity')
table_3_stats
| Average_Medicare_Bystatescity | |
|---|---|
| mean | 8133.157995 |
| min | 1148.900000 |
| 25% | 4070.555000 |
| 50% | 5925.720000 |
| 75% | 9701.325000 |
| 90% | 14810.964000 |
| 95% | 20455.067000 |
| max | 154620.810000 |
# Add average medicare payments by states to the dataset
df3 = pd.merge(data, table_3, how='left', on=['DRG', 'Provider_State', 'Provider_City'])
# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df3['Average_Medicare_Bystatescity_Ratio'] = np.where(df3['Average_Medicare_Bystatescity']==0,0, df3['Average_Medicare_Payments'] / df3['Average_Medicare_Bystatescity'])
Distribution of Average Medicare Payments By State and City Ratio¶
# Define variables and quantile range
var = 'Average_Medicare_Bystatescity_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df3[binned_var] = pd.qcut(df3[var], percentile, duplicates='drop')
# Create the bin labels
df3['bin_label'] = df3[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df3['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)
# Update the layout
fig.update_layout(
title='Distribution of Average Medicare Payments By State and City Ratio',
xaxis_title='Quantile Range',
yaxis_title='Count',
template = 'simple_white'
)
# Show the plot
fig.show()
# Calculate 95th percentile threshold
threshold = np.percentile(df3[var], 95)
# Create anomaly indicator
df3['Anomaly'] = np.where(df3[var] > threshold, 1, 0)
# Count anomalies
anomaly_counts = df3['Anomaly'].value_counts()
anomaly_counts
Anomaly 0 154911 1 8154 Name: count, dtype: int64
This feature further disaggregating by city within each state provides insights into localized healthcare payment patterns. It identifies cities where Medicare payments differ markedly across various DRGs, indicating potential anomalies in healthcare service delivery or cost structures at the city level.
3.1.4 Avgerage Medicare Payments per DRG by Zip Code ¶
table_4 = data.groupby(['DRG', 'Provider_Zipcode'])['Average_Medicare_Payments'].mean().reset_index()
table_4 = table_4.rename(columns={'Average_Medicare_Payments': 'Average_Medicare_Byzip'}).sort_values(by='Average_Medicare_Byzip', ascending=False)
table_4.head(10)
| DRG | Provider_Zipcode | Average_Medicare_Byzip | |
|---|---|---|---|
| 66990 | 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... | 94305 | 154620.81 |
| 278041 | 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... | 10595 | 133177.26 |
| 66996 | 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... | 94538 | 113462.09 |
| 283753 | 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | 94305 | 109303.21 |
| 281094 | 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | 10595 | 107456.18 |
| 283585 | 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | 90095 | 101717.82 |
| 158974 | 329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC | 10595 | 100040.26 |
| 204951 | 469 - MAJOR JOINT REPLACEMENT OR REATTACHMENT ... | 17033 | 99114.86 |
| 64331 | 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... | 10595 | 98792.46 |
| 278317 | 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... | 20060 | 95701.42 |
table_4_stats = calculate_descriptive_stats(table_4, 'Average_Medicare_Byzip')
table_4_stats
| Average_Medicare_Byzip | |
|---|---|
| mean | 8455.257017 |
| min | 1148.900000 |
| 25% | 4172.160000 |
| 50% | 6125.210000 |
| 75% | 10014.280000 |
| 90% | 15625.386000 |
| 95% | 21653.620000 |
| max | 154620.810000 |
# Add average medicare payments by states to the dataset
df4 = pd.merge(data, table_4, how='left', on=['DRG', 'Provider_Zipcode'])
# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df4['Average_Medicare_Byzip_Ratio'] = np.where(df4['Average_Medicare_Byzip']==0,0, df4['Average_Medicare_Payments'] / df4['Average_Medicare_Byzip'])
Distribution of Average Medicare Payments By Zip Code Ratio¶
# Define variables and quantile range
var = 'Average_Medicare_Byzip_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df4[binned_var] = pd.qcut(df4[var], percentile, duplicates='drop')
# Create the bin labels
df4['bin_label'] = df4[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df4['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)
# Update the layout
fig.update_layout(
title='Distribution of Average Medicare Payments By Zip Code Ratio',
xaxis_title='Quantile Range',
yaxis_title='Count',
template = 'simple_white'
)
# Show the plot
fig.show()
# Calculate 95th percentile threshold
threshold = np.percentile(df4[var], 95)
# Create anomaly indicator
df4['Anomaly'] = np.where(df4[var] > threshold, 1, 0)
# Count anomalies
anomaly_counts = df4['Anomaly'].value_counts()
anomaly_counts
Anomaly 0 155070 1 7995 Name: count, dtype: int64
The count of anomaly will mainly performed on 'by state & city' and 'by zip code' because those two levels provide more indepth insights.
This feature add zip code-level granularity offers the most detailed view of Medicare payment variations within specific cities. It identifies zip codes where Medicare payments deviate significantly within the same city and state, pinpointing localized anomalies or disparities in healthcare utilization and costs.
3.2 Average Total Payments ¶
For this section, benchmarks are calculated by grouping the DRGs, provider states, cities and zip code from the dataset, then take the mean of the Average Total payments.
Three features are created under this section which are Average Total Payments by DRG, States, Cities & Zip Code. They can provides comprehensive insights into healthcare expenditure patterns across various levels of granularity.
3.2.1 Per DRG by State ¶
table_5 = data.groupby(['DRG', 'Provider_State'])['Average_Total_Payments'].mean().reset_index()
table_5 = table_5.rename(columns={'Average_Total_Payments': 'Average_Total_Bystate'}).sort_values(by='Average_Total_Bystate', ascending=False)
table_5.head(10)
| DRG | Provider_State | Average_Total_Bystate | |
|---|---|---|---|
| 2652 | 329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC | AK | 68006.425000 |
| 4648 | 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... | DC | 66437.692500 |
| 4692 | 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | AK | 61300.630000 |
| 4641 | 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... | AK | 57238.680000 |
| 4652 | 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... | HI | 56482.922000 |
| 4726 | 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | NY | 56465.890641 |
| 1071 | 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... | AK | 54385.475000 |
| 4696 | 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | CA | 53733.333710 |
| 4738 | 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | VT | 52761.330000 |
| 4699 | 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | DC | 52491.154000 |
table_5_stats = calculate_descriptive_stats(table_5, 'Average_Total_Bystate')
table_5_stats
| Average_Total_Bystate | |
|---|---|
| mean | 10640.517669 |
| min | 3236.026333 |
| 25% | 5583.888889 |
| 50% | 7780.726250 |
| 75% | 12493.163000 |
| 90% | 19845.450487 |
| 95% | 27083.356325 |
| max | 68006.425000 |
# Add average medicare payments by states to the dataset
df5 = pd.merge(data, table_5, how='left', on=['DRG', 'Provider_State'])
# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df5['Average_Total_Bystates_Ratio'] = np.where(df5['Average_Total_Bystate']==0,0, df5['Average_Total_Payments'] / df5['Average_Total_Bystate'])
Distribution of Average Total Payments By State Ratio¶
# Define variables and quantile range
var = 'Average_Total_Bystates_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df5[binned_var] = pd.qcut(df5[var], percentile)
# Create the bin labels
df5['bin_label'] = df5[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df5['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.dense)
# Update the layout
fig.update_layout(
title='Distribution of Average Total Payments By State Ratio',
xaxis_title='Quantile Range',
yaxis_title='Count',
template = 'simple_white'
)
# Show the plot
fig.show()
This feature adds state-level granularity to the analysis identifies regional variations in total payments for each DRG.
3.2.2 Per DRG by State & City ¶
table_6 = data.groupby(['DRG', 'Provider_State', 'Provider_City'])['Average_Total_Payments'].mean().reset_index()
table_6 = table_6.rename(columns={'Average_Total_Payments': 'Average_Total_Bystatescity'}).sort_values(by='Average_Total_Bystatescity', ascending=False)
table_6.head(10)
| DRG | Provider_State | Provider_City | Average_Total_Bystatescity | |
|---|---|---|---|---|
| 2126950 | 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... | CA | STANFORD | 156158.18 |
| 9244283 | 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... | NY | VALHALLA | 140255.26 |
| 2125896 | 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... | CA | FREMONT | 119113.00 |
| 9345110 | 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | NY | VALHALLA | 119028.90 |
| 9285667 | 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | CA | STANFORD | 109945.57 |
| 5312030 | 329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC | NY | VALHALLA | 101796.26 |
| 6831303 | 469 - MAJOR JOINT REPLACEMENT OR REATTACHMENT ... | PA | HERSHEY | 100018.33 |
| 2186393 | 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... | NY | VALHALLA | 99929.46 |
| 9184840 | 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... | CA | STANFORD | 95302.74 |
| 9361733 | 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | TX | GALVESTON | 90701.58 |
table_6_stats = calculate_descriptive_stats(table_6, 'Average_Total_Bystatescity')
table_6_stats
| Average_Total_Bystatescity | |
|---|---|
| mean | 9316.580074 |
| min | 2673.000000 |
| 25% | 5105.865000 |
| 50% | 6969.000000 |
| 75% | 10860.125000 |
| 90% | 16328.638000 |
| 95% | 22435.810500 |
| max | 156158.180000 |
# Add average medicare payments by states to the dataset
df6 = pd.merge(data, table_6, how='left', on=['DRG', 'Provider_State', 'Provider_City'])
# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df6['Average_Total_Bystatescity_Ratio'] = np.where(df6['Average_Total_Bystatescity']==0,0, df6['Average_Total_Payments'] / df6['Average_Total_Bystatescity'])
Distribution of Average Total Payments By State and City Ratio¶
# Define variables and quantile range
var = 'Average_Total_Bystatescity_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df6[binned_var] = pd.qcut(df6[var], percentile, duplicates='drop')
# Create the bin labels
df6['bin_label'] = df6[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df6['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.dense)
# Update the layout
fig.update_layout(
title='Distribution of Average Total Payments By State and City Ratio',
xaxis_title='Quantile Range',
yaxis_title='Count',
template = 'simple_white'
)
# Show the plot
fig.show()
# Calculate 95th percentile threshold
threshold = np.percentile(df6[var], 95)
# Create anomaly indicator
df6['Anomaly'] = np.where(df6[var] > threshold, 1, 0)
# Count anomalies
anomaly_counts = df6['Anomaly'].value_counts()
anomaly_counts
Anomaly 0 154911 1 8154 Name: count, dtype: int64
Simliar to Average medicare payments, this feature add additional level of granularity in detecting anomalies.
3.2.3 Per DRG by Zip Code ¶
table_7 = data.groupby(['DRG', 'Provider_Zipcode'])['Average_Total_Payments'].mean().reset_index()
table_7 = table_7.rename(columns={'Average_Total_Payments': 'Average_Total_Byzip'}).sort_values(by='Average_Total_Byzip', ascending=False)
table_7.head(10)
| DRG | Provider_Zipcode | Average_Total_Byzip | |
|---|---|---|---|
| 66990 | 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... | 94305 | 156158.18 |
| 278041 | 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... | 10595 | 140255.26 |
| 66996 | 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... | 94538 | 119113.00 |
| 281094 | 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | 10595 | 119028.90 |
| 283753 | 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | 94305 | 109945.57 |
| 283585 | 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | 90095 | 103802.96 |
| 158974 | 329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC | 10595 | 101796.26 |
| 204951 | 469 - MAJOR JOINT REPLACEMENT OR REATTACHMENT ... | 17033 | 100018.33 |
| 64331 | 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... | 10595 | 99929.46 |
| 159489 | 329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC | 29220 | 99307.30 |
table_7_stats = calculate_descriptive_stats(table_7, 'Average_Total_Byzip')
table_7_stats
| Average_Total_Byzip | |
|---|---|
| mean | 9665.549998 |
| min | 2682.640000 |
| 25% | 5215.000000 |
| 50% | 7178.760000 |
| 75% | 11233.000000 |
| 90% | 17190.889000 |
| 95% | 23786.226000 |
| max | 156158.180000 |
# Add average medicare payments by states to the dataset
df7 = pd.merge(data, table_7, how='left', on=['DRG', 'Provider_Zipcode'])
# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df7['Average_Total_Byzip_Ratio'] = np.where(df7['Average_Total_Byzip']==0,0, df7['Average_Total_Payments'] / df7['Average_Total_Byzip'])
Distribution of Average Total Payments By Zip Code Ratio¶
# Define variables and quantile range
var = 'Average_Total_Byzip_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df7[binned_var] = pd.qcut(df7[var], percentile, duplicates='drop')
# Create the bin labels
df7['bin_label'] = df7[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df7['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.dense)
# Update the layout
fig.update_layout(
title='Distribution of Average Total Payments By Zip Code Ratio',
xaxis_title='Quantile Range',
yaxis_title='Count',
template = 'simple_white'
)
# Show the plot
fig.show()
# Calculate 95th percentile threshold
threshold = np.percentile(df7[var], 95)
# Create anomaly indicator
df7['Anomaly'] = np.where(df7[var] > threshold, 1, 0)
# Count anomalies
anomaly_counts = df7['Anomaly'].value_counts()
anomaly_counts
Anomaly 0 155056 1 8009 Name: count, dtype: int64
Simliar to avg medicare payments, this feature add additional level of granularity.
This section of Average Total Payments focuses all costs associated with healthcare services, offering a broader perspective on anomalies related to overall service costs, resource, and patient payments.
3.3 Average Out of Pocket Payments ¶
For this section, benchmarks are calculated by using Total payments minus Medicare payments to get the Average Out-Of-Pocket Payments of Patients. Then group the DRGs, provider states, cities and zip code from the dataset, then take the mean of the Average OOP payments.
Three features are created under this section which are Average OOP Payments per DRG, by States, Cities & Zip Code. They can provides insights into healthcare expenditure patterns across various levels of granularity from the patient's perspective. Hospital that charges a higher than average OOP cost could considered anomalies.
3.3.1 Per DRG by States ¶
data['Average_OOP'] = data['Average_Total_Payments'] - data['Average_Medicare_Payments']
data.head()
| DRG | Provider_Id | Provider_Name | Provider_StreetAddress | Provider_City | Provider_State | Provider_Zipcode | Hospital_referral_region_desp | Total_Discharges | Average_Covered_Charges | Average_Total_Payments | Average_Medicare_Payments | Average_OOP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10001 | SOUTHEAST ALABAMA MEDICAL CENTER | 1108 ROSS CLARK CIRCLE | DOTHAN | AL | 36301 | AL - Dothan | 91 | 32963.07 | 5777.24 | 4763.73 | 1013.51 |
| 1 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10005 | MARSHALL MEDICAL CENTER SOUTH | 2505 U S HIGHWAY 431 NORTH | BOAZ | AL | 35957 | AL - Birmingham | 14 | 15131.85 | 5787.57 | 4976.71 | 810.86 |
| 2 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10006 | ELIZA COFFEE MEMORIAL HOSPITAL | 205 MARENGO STREET | FLORENCE | AL | 35631 | AL - Birmingham | 24 | 37560.37 | 5434.95 | 4453.79 | 981.16 |
| 3 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10011 | ST VINCENT'S EAST | 50 MEDICAL PARK EAST DRIVE | BIRMINGHAM | AL | 35235 | AL - Birmingham | 25 | 13998.28 | 5417.56 | 4129.16 | 1288.40 |
| 4 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10016 | SHELBY BAPTIST MEDICAL CENTER | 1000 FIRST STREET NORTH | ALABASTER | AL | 35007 | AL - Birmingham | 18 | 31633.27 | 5658.33 | 4851.44 | 806.89 |
table_8 = data.groupby(['DRG', 'Provider_State'])['Average_OOP'].mean().reset_index()
table_8 = table_8.rename(columns={'Average_OOP': 'Average_OOP_Bystates'})
table_8.head(10)
| DRG | Provider_State | Average_OOP_Bystates | |
|---|---|---|---|
| 0 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | AK | 1988.170000 |
| 1 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | AL | 1144.018696 |
| 2 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | AR | 1180.116250 |
| 3 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | AZ | 1358.722917 |
| 4 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | CA | 1508.977761 |
| 5 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | CO | 1631.992000 |
| 6 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | CT | 1301.466000 |
| 7 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | DC | 1383.156667 |
| 8 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | DE | 1408.896667 |
| 9 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | FL | 1334.597778 |
table_8_stats = calculate_descriptive_stats(table_8, 'Average_OOP_Bystates')
table_8_stats
| Average_OOP_Bystates | |
|---|---|
| mean | 1350.846689 |
| min | 359.710000 |
| 25% | 957.941455 |
| 50% | 1105.953333 |
| 75% | 1424.579545 |
| 90% | 2109.421204 |
| 95% | 2687.367642 |
| max | 19964.515000 |
# Add average medicare payments by states to the dataset
df8 = pd.merge(data, table_8, how='left', on=['DRG', 'Provider_State'])
# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df8['Average_OOP_Bystates_Ratio'] = np.where(df8['Average_OOP_Bystates']==0,0, df8['Average_OOP'] / df8['Average_OOP_Bystates'])
Distribution of Average Out-Of-Pocket Payments By State Ratio¶
# Define variables and quantile range
var = 'Average_OOP_Bystates_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df8[binned_var] = pd.qcut(df8[var], percentile)
# Create the bin labels
df8['bin_label'] = df8[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df8['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Purpor_r)
# Update the layout
fig.update_layout(
title='Distribution of Average Out-Of-Pocket Payments By State Ratio',
xaxis_title='Quantile Range',
yaxis_title='Count',
template = 'simple_white'
)
# Show the plot
fig.show()
This feature provides insights into the disparities in patient expenses across different states. Payments fall into the high range i.e. [1.91-3.20], [3.20-38.19] could be could be consider anomalies.
3.3.2 Per DRG by State & City ¶
table_9 = data.groupby(['DRG', 'Provider_State', 'Provider_City'])['Average_OOP'].mean().reset_index()
table_9 = table_9.rename(columns={'Average_OOP': 'Average_OOP_Bystatescity'}).sort_values(by='Average_OOP_Bystatescity', ascending=False)
table_9.head(10)
| DRG | Provider_State | Provider_City | Average_OOP_Bystatescity | |
|---|---|---|---|---|
| 2916206 | 249 - PERC CARDIOVASC PROC W NON-DRUG-ELUTING ... | WA | BELLEVUE | 75998.660 |
| 6748071 | 460 - SPINAL FUSION EXCEPT CERVICAL W/O MCC | WA | FEDERAL WAY | 51457.650 |
| 3054632 | 252 - OTHER VASCULAR PROCEDURES W MCC | IN | BLOOMINGTON | 41686.250 |
| 2054178 | 203 - BRONCHITIS & ASTHMA W/O CC/MCC | MA | ATTLEBORO | 39691.910 |
| 2805011 | 247 - PERC CARDIOVASC PROC W DRUG-ELUTING STEN... | SD | SIOUX FALLS | 37708.605 |
| 2682933 | 246 - PERC CARDIOVASC PROC W DRUG-ELUTING STEN... | NJ | BERLIN | 37034.820 |
| 6723696 | 460 - SPINAL FUSION EXCEPT CERVICAL W/O MCC | NY | WEST ISLIP | 35437.920 |
| 1286505 | 189 - PULMONARY EDEMA & RESPIRATORY FAILURE | PA | READING | 33117.260 |
| 5329100 | 329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC | TX | MCKINNEY | 32245.250 |
| 2551805 | 244 - PERMANENT CARDIAC PACEMAKER IMPLANT W/O ... | IN | RICHMOND | 32146.450 |
table_9_stats = calculate_descriptive_stats(table_9, 'Average_OOP_Bystatescity')
table_9_stats
| Average_OOP_Bystatescity | |
|---|---|
| mean | 1183.422079 |
| min | 0.000000 |
| 25% | 783.055000 |
| 50% | 938.000000 |
| 75% | 1204.592778 |
| 90% | 1846.114000 |
| 95% | 2579.456000 |
| max | 75998.660000 |
# Add average medicare payments by states to the dataset
df9 = pd.merge(data, table_9, how='left', on=['DRG', 'Provider_State', 'Provider_City'])
# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df9['Average_OOP_Bystatescity_Ratio'] = np.where(df9['Average_OOP_Bystatescity']==0,0, df9['Average_OOP'] / df9['Average_OOP_Bystatescity'])
Distribution of Average Out-Of-Pocket Payments By State & City Ratio¶
# Define variables and quantile range
var = 'Average_OOP_Bystatescity_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df9[binned_var] = pd.qcut(df9[var], percentile, duplicates='drop')
# Create the bin labels
df9['bin_label'] = df9[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df9['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Purpor)
# Update the layout
fig.update_layout(
title='Distribution of Average Out-Of-Pocket Payments By State & City Ratio',
xaxis_title='Quantile Range',
yaxis_title='Count',
template = 'simple_white'
)
# Show the plot
fig.show()
# Calculate 95th percentile threshold
threshold = np.percentile(df9[var], 95)
# Create anomaly indicator
df9['Anomaly'] = np.where(df9[var] > threshold, 1, 0)
# Count anomalies
anomaly_counts = df9['Anomaly'].value_counts()
anomaly_counts
Anomaly 0 154911 1 8154 Name: count, dtype: int64
Similar as the previous features, this adds city level granularity. Payments fall in the higher bin could be flagged as anomalies.
3.3.3 Per DRG by Zip Code ¶
table_10 = data.groupby(['DRG', 'Provider_Zipcode'])['Average_OOP'].mean().reset_index()
table_10 = table_10.rename(columns={'Average_OOP': 'Average_OOP_Byzip'}).sort_values(by='Average_OOP_Byzip', ascending=False)
table_10.head(10)
| DRG | Provider_Zipcode | Average_OOP_Byzip | |
|---|---|---|---|
| 88487 | 249 - PERC CARDIOVASC PROC W NON-DRUG-ELUTING ... | 98004 | 75998.66 |
| 84205 | 247 - PERC CARDIOVASC PROC W DRUG-ELUTING STEN... | 57108 | 74168.03 |
| 37116 | 189 - PULMONARY EDEMA & RESPIRATORY FAILURE | 19605 | 65079.84 |
| 204500 | 460 - SPINAL FUSION EXCEPT CERVICAL W/O MCC | 98003 | 51457.65 |
| 65978 | 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... | 60657 | 41849.50 |
| 93126 | 252 - OTHER VASCULAR PROCEDURES W MCC | 47403 | 41686.25 |
| 61115 | 203 - BRONCHITIS & ASTHMA W/O CC/MCC | 2703 | 39691.91 |
| 79549 | 246 - PERC CARDIOVASC PROC W DRUG-ELUTING STEN... | 8009 | 37034.82 |
| 201759 | 460 - SPINAL FUSION EXCEPT CERVICAL W/O MCC | 11795 | 35437.92 |
| 90311 | 251 - PERC CARDIOVASC PROC W/O CORONARY ARTERY... | 57108 | 34721.65 |
table_10_stats = calculate_descriptive_stats(table_10, 'Average_OOP_Byzip')
table_10_stats
| Average_OOP_Byzip | |
|---|---|
| mean | 1210.292981 |
| min | 0.000000 |
| 25% | 778.770000 |
| 50% | 938.000000 |
| 75% | 1222.000000 |
| 90% | 1923.914000 |
| 95% | 2715.580000 |
| max | 75998.660000 |
# Add average medicare payments by states to the dataset
df10 = pd.merge(data, table_10, how='left', on=['DRG', 'Provider_Zipcode'])
# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df10['Average_OOP_Byzip_Ratio'] = np.where(df10['Average_OOP_Byzip']==0,0, df10['Average_OOP'] / df10['Average_OOP_Byzip'])
Distribution of Average Out-Of-Pocket Payments By Zip Code Ratio¶
# Define variables and quantile range
var = 'Average_OOP_Byzip_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df10[binned_var] = pd.qcut(df10[var], percentile, duplicates='drop')
# Create the bin labels
df10['bin_label'] = df10[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df10['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Purpor)
# Update the layout
fig.update_layout(
title='Distribution of Average Out-Of-Pocket Payments By Zip Code Ratio',
xaxis_title='Quantile Range',
yaxis_title='Count',
template = 'simple_white'
)
# Show the plot
fig.show()
# Calculate 95th percentile threshold
threshold = np.percentile(df10[var], 95)
# Create anomaly indicator
df10['Anomaly'] = np.where(df10[var] > threshold, 1, 0)
# Count anomalies
anomaly_counts = df10['Anomaly'].value_counts()
anomaly_counts
Anomaly 0 155070 1 7995 Name: count, dtype: int64
Similar as the previous features, this adds more level of granularity. Payments fall in the higher bin could be flagged as anomalies.
Data Preparation for Modeling¶
# Append new features back to orginal dataset for future anlaysis
# Make a copy of the original DataFrame
data_copy = data.copy()
# List of DataFrames
dfs = [df1, df2, df3, df4, df5, df6, df7, df8, df9, df10]
# Function to filter and merge only the 'Ratio' columns
def merge_ratio_columns(left, right):
ratio_columns = [col for col in right.columns if col.endswith('Ratio')]
return pd.merge(left, right[['DRG', 'Provider_Id', 'Provider_State', 'Provider_City', 'Provider_Zipcode'] + ratio_columns],
on=['DRG', 'Provider_Id', 'Provider_State', 'Provider_City', 'Provider_Zipcode'], how='left')
# Use reduce to iteratively merge all DataFrames with the ratio columns into df_copy
final_df = reduce(merge_ratio_columns, dfs, data_copy)
# Display the final merged DataFrame
final_df.head()
| DRG | Provider_Id | Provider_Name | Provider_StreetAddress | Provider_City | Provider_State | Provider_Zipcode | Hospital_referral_region_desp | Total_Discharges | Average_Covered_Charges | ... | Average_Medicare_byDRG_Ratio | Average_Medicare_Bystates_Ratio | Average_Medicare_Bystatescity_Ratio | Average_Medicare_Byzip_Ratio | Average_Total_Bystates_Ratio | Average_Total_Bystatescity_Ratio | Average_Total_Byzip_Ratio | Average_OOP_Bystates_Ratio | Average_OOP_Bystatescity_Ratio | Average_OOP_Byzip_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10001 | SOUTHEAST ALABAMA MEDICAL CENTER | 1108 ROSS CLARK CIRCLE | DOTHAN | AL | 36301 | AL - Dothan | 91 | 32963.07 | ... | 0.857428 | 1.035685 | 1.038763 | 1.0 | 1.005855 | 1.037810 | 1.0 | 0.885921 | 1.033356 | 1.0 |
| 1 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10005 | MARSHALL MEDICAL CENTER SOUTH | 2505 U S HIGHWAY 431 NORTH | BOAZ | AL | 35957 | AL - Birmingham | 14 | 15131.85 | ... | 0.895762 | 1.081989 | 1.000000 | 1.0 | 1.007653 | 1.000000 | 1.0 | 0.708782 | 1.000000 | 1.0 |
| 2 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10006 | ELIZA COFFEE MEMORIAL HOSPITAL | 205 MARENGO STREET | FLORENCE | AL | 35631 | AL - Birmingham | 24 | 37560.37 | ... | 0.801642 | 0.968301 | 1.000000 | 1.0 | 0.946260 | 1.000000 | 1.0 | 0.857643 | 1.000000 | 1.0 |
| 3 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10011 | ST VINCENT'S EAST | 50 MEDICAL PARK EAST DRIVE | BIRMINGHAM | AL | 35235 | AL - Birmingham | 25 | 13998.28 | ... | 0.743211 | 0.897723 | 0.899200 | 1.0 | 0.943232 | 0.915208 | 1.0 | 1.126205 | 0.970586 | 1.0 |
| 4 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10016 | SHELBY BAPTIST MEDICAL CENTER | 1000 FIRST STREET NORTH | ALABASTER | AL | 35007 | AL - Birmingham | 18 | 31633.27 | ... | 0.873215 | 1.054754 | 1.000000 | 1.0 | 0.985152 | 1.000000 | 1.0 | 0.705312 | 1.000000 | 1.0 |
5 rows × 23 columns
# Extracting only the ratio columns
ratio_columns = [col for col in final_df.columns if col.endswith('Ratio')]
X = final_df[ratio_columns]
# Splitting data into train and test sets for X (features only)
X_train, X_test = train_test_split(X, test_size=0.2, random_state=123)
Section 4: K-Nearest Neighbor (KNN)
4.1 KNN ¶
KNN identifies anomalies by comparing the distance of each data point to its nearest neighbors. Points that are far from their neighbors are considered outliers. This approach is easy to implement and understand. It is non-parametric and computationally intensive for large datasets. Also sensitive to noise which could lead to incorrect outlier detection
# Function to count the unique values
def count_stat(vector):
unique, counts = np.unique(vector, return_counts=True)
return dict(zip(unique, counts))
def descriptive_stat_threshold(df, pred_score, threshold):
df = pd.DataFrame(df)
df['Anomaly_Score'] = pred_score
df['Group'] = np.where(df['Anomaly_Score'] < threshold, 'Normal', 'Outlier')
cnt = df.groupby('Group')['Anomaly_Score'].count().reset_index().rename(columns={'Anomaly_Score': 'Count'})
cnt['Count %'] = (cnt['Count'] / cnt['Count'].sum()) * 100
stat = df.groupby('Group').mean().round(2).reset_index()
stat = cnt.merge(stat, on='Group')
return stat
# Build the KNN model
knn = KNN(contamination=0.05)
knn.fit(X_train)
# Predict anomaly scores for training and test data
y_train_scores = knn.decision_function(X_train)
y_train_pred = knn.predict(X_train)
y_test_scores = knn.decision_function(X_test)
y_test_pred = knn.predict(X_test)
# Print the statistics of the predictions for training and test data
print("Training Data Predictions:", count_stat(y_train_pred))
print("Test Data Predictions:", count_stat(y_test_pred))
# Get descriptive statistics for outliers
threshold = knn.threshold_
# Print the threshold for the defined contamination rate
print("Threshold for Contamination Rate:", knn.threshold_, '\n')
Training Data Predictions: {0: 124620, 1: 5832}
Test Data Predictions: {0: 30977, 1: 1636}
Threshold for Contamination Rate: 0.20077006129635047
knn.get_params()
{'algorithm': 'auto',
'contamination': 0.05,
'leaf_size': 30,
'method': 'largest',
'metric': 'minkowski',
'metric_params': None,
'n_jobs': 1,
'n_neighbors': 5,
'p': 2,
'radius': 1.0}
stats_train = descriptive_stat_threshold(pd.DataFrame(X_train, columns=ratio_columns), y_train_scores, threshold)
stats_test = descriptive_stat_threshold(pd.DataFrame(X_test, columns=ratio_columns), y_test_scores, threshold)
# Descriptive statistics for train data
print("Training Data Statistics:")
stats_train
Training Data Statistics:
| Group | Count | Count % | Average_Medicare_byDRG_Ratio | Average_Medicare_Bystates_Ratio | Average_Medicare_Bystatescity_Ratio | Average_Medicare_Byzip_Ratio | Average_Total_Bystates_Ratio | Average_Total_Bystatescity_Ratio | Average_Total_Byzip_Ratio | Average_OOP_Bystates_Ratio | Average_OOP_Bystatescity_Ratio | Average_OOP_Byzip_Ratio | Anomaly_Score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Normal | 124620 | 95.52939 | 0.98 | 0.99 | 1.00 | 1.00 | 0.98 | 0.99 | 1.00 | 0.96 | 0.99 | 1.00 | 0.05 |
| 1 | Outlier | 5832 | 4.47061 | 1.34 | 1.24 | 1.07 | 1.01 | 1.33 | 1.11 | 1.02 | 1.92 | 1.30 | 1.04 | 0.33 |
# Descriptive statistics for train data
print("Test Data Statistics:")
stats_test
Test Data Statistics:
| Group | Count | Count % | Average_Medicare_byDRG_Ratio | Average_Medicare_Bystates_Ratio | Average_Medicare_Bystatescity_Ratio | Average_Medicare_Byzip_Ratio | Average_Total_Bystates_Ratio | Average_Total_Bystatescity_Ratio | Average_Total_Byzip_Ratio | Average_OOP_Bystates_Ratio | Average_OOP_Bystatescity_Ratio | Average_OOP_Byzip_Ratio | Anomaly_Score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Normal | 30977 | 94.983595 | 0.98 | 0.99 | 1.00 | 1.00 | 0.98 | 0.99 | 1.00 | 0.95 | 0.99 | 1.00 | 0.05 |
| 1 | Outlier | 1636 | 5.016405 | 1.32 | 1.22 | 1.07 | 1.01 | 1.31 | 1.10 | 1.02 | 1.89 | 1.29 | 1.04 | 0.33 |
4.1.1 KNN Outlier List ¶
# Add a 'Prediction' column to original dataset
final_df['Prediction'] = 0 # Initialize with 0
# Mark outliers in training data
final_df.loc[X_train.index, 'Prediction'] = y_train_pred
# Mark outliers in test data
final_df.loc[X_test.index, 'Prediction'] = y_test_pred
final_df.describe()
| Provider_Zipcode | Total_Discharges | Average_Covered_Charges | Average_Total_Payments | Average_Medicare_Payments | Average_OOP | Average_Medicare_byDRG_Ratio | Average_Medicare_Bystates_Ratio | Average_Medicare_Bystatescity_Ratio | Average_Medicare_Byzip_Ratio | Average_Total_Bystates_Ratio | Average_Total_Bystatescity_Ratio | Average_Total_Byzip_Ratio | Average_OOP_Bystates_Ratio | Average_OOP_Bystatescity_Ratio | Average_OOP_Byzip_Ratio | Prediction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 163065.000000 | 163065.000000 | 163065.000000 | 163065.000000 | 163065.000000 | 163065.000000 | 163065.000000 | 163065.000000 | 163065.000000 | 163065.000000 | 163065.000000 | 163065.000000 | 163065.000000 | 163065.000000 | 163065.000000 | 163065.000000 | 163065.000000 |
| mean | 47938.121908 | 42.776304 | 36133.954224 | 9707.473804 | 8494.490964 | 1212.982840 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.999994 | 0.999994 | 0.045798 |
| std | 27854.323080 | 51.104042 | 35065.365931 | 7664.642598 | 7309.467261 | 1148.097584 | 0.281375 | 0.215091 | 0.101956 | 0.038675 | 0.195274 | 0.092549 | 0.035360 | 0.586527 | 0.229179 | 0.086610 | 0.209047 |
| min | 1040.000000 | 11.000000 | 2459.400000 | 2673.000000 | 1148.900000 | 0.000000 | 0.337408 | 0.339831 | 0.261737 | 0.261737 | 0.384805 | 0.238195 | 0.303604 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 27261.000000 | 17.000000 | 15947.160000 | 5234.500000 | 4192.350000 | 776.310000 | 0.820838 | 0.868395 | 1.000000 | 1.000000 | 0.880824 | 1.000000 | 1.000000 | 0.721250 | 1.000000 | 1.000000 | 0.000000 |
| 50% | 44309.000000 | 27.000000 | 25245.820000 | 7214.100000 | 6158.460000 | 935.720000 | 0.917180 | 0.954250 | 1.000000 | 1.000000 | 0.956830 | 1.000000 | 1.000000 | 0.882159 | 1.000000 | 1.000000 | 0.000000 |
| 75% | 72901.000000 | 49.000000 | 43232.590000 | 11286.400000 | 10056.880000 | 1221.410000 | 1.090344 | 1.071483 | 1.000000 | 1.000000 | 1.064688 | 1.000000 | 1.000000 | 1.101927 | 1.000000 | 1.000000 | 0.000000 |
| max | 99835.000000 | 3383.000000 | 929118.900000 | 156158.180000 | 154620.810000 | 75998.660000 | 7.672359 | 5.174415 | 3.846861 | 1.782259 | 7.857934 | 3.753099 | 1.897813 | 38.192611 | 10.028956 | 3.545154 | 1.000000 |
# Filter out outliers
outliers = final_df[final_df['Prediction'] == 1]
# Display the list of outliers
outliers_list = outliers[['DRG', 'Provider_Id', 'Provider_Name','Provider_State', 'Provider_City', 'Provider_Zipcode'] + ratio_columns + ['Prediction']]
print("List of Outliers:")
outliers_list
List of Outliers:
| DRG | Provider_Id | Provider_Name | Provider_State | Provider_City | Provider_Zipcode | Average_Medicare_byDRG_Ratio | Average_Medicare_Bystates_Ratio | Average_Medicare_Bystatescity_Ratio | Average_Medicare_Byzip_Ratio | Average_Total_Bystates_Ratio | Average_Total_Bystatescity_Ratio | Average_Total_Byzip_Ratio | Average_OOP_Bystates_Ratio | Average_OOP_Bystatescity_Ratio | Average_OOP_Byzip_Ratio | Prediction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 44 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 30100 | CARONDELET HEART AND VASCULAR INSTITUTE | AZ | TUCSON | 85704 | 0.646680 | 0.607636 | 0.731709 | 1.000000 | 0.821832 | 0.942372 | 1.000000 | 1.753963 | 1.665115 | 1.000000 | 1 |
| 67 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 50017 | MERCY GENERAL HOSPITAL | CA | SACRAMENTO | 95819 | 1.246383 | 1.006650 | 0.844865 | 1.000000 | 1.295101 | 1.042120 | 1.000000 | 2.610058 | 1.767791 | 1.000000 | 1 |
| 88 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 50180 | JOHN MUIR MEDICAL CENTER - WALNUT CREEK CAMPUS | CA | WALNUT CREEK | 94598 | 1.219708 | 0.985106 | 1.000000 | 1.000000 | 1.473051 | 1.000000 | 1.000000 | 3.697437 | 1.000000 | 1.000000 | 1 |
| 104 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 50334 | SALINAS VALLEY MEMORIAL HOSPITAL | CA | SALINAS | 93901 | 1.269125 | 1.025018 | 1.000000 | 1.000000 | 1.996299 | 1.000000 | 1.000000 | 6.424064 | 1.000000 | 1.000000 | 1 |
| 123 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 50599 | UNIVERSITY OF CALIFORNIA DAVIS MEDICAL CENTER | CA | SACRAMENTO | 95817 | 1.756405 | 1.418573 | 1.190586 | 1.000000 | 1.372732 | 1.104587 | 1.000000 | 1.163755 | 0.788210 | 1.000000 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 162869 | 948 - SIGNS & SYMPTOMS W/O MCC | 450508 | MEMORIAL HOSPITAL | TX | NACOGDOCHES | 75961 | 0.809226 | 0.846520 | 0.974436 | 0.974436 | 0.931781 | 1.091734 | 1.091734 | 1.289130 | 1.632630 | 1.632630 | 1 |
| 162934 | 948 - SIGNS & SYMPTOMS W/O MCC | 490032 | MEDICAL COLLEGE OF VIRGINIA HOSPITALS | VA | RICHMOND | 23298 | 1.510141 | 1.751285 | 1.538383 | 1.000000 | 2.039643 | 1.661285 | 1.000000 | 3.093812 | 1.990335 | 1.000000 | 1 |
| 162972 | 948 - SIGNS & SYMPTOMS W/O MCC | 500008 | UNIVERSITY OF WASHINGTON MEDICAL CTR | WA | SEATTLE | 98195 | 1.747121 | 1.759734 | 1.445275 | 1.000000 | 1.706477 | 1.452658 | 1.000000 | 1.541961 | 1.479298 | 1.000000 | 1 |
| 163030 | 948 - SIGNS & SYMPTOMS W/O MCC | 520059 | AURORA MEMORIAL HSPTL BURLINGTON | WI | BURLINGTON | 53105 | 0.676813 | 0.737801 | 1.000000 | 1.000000 | 1.512762 | 1.000000 | 1.000000 | 4.173365 | 1.000000 | 1.000000 | 1 |
| 163050 | 948 - SIGNS & SYMPTOMS W/O MCC | 520138 | AURORA ST LUKES MEDICAL CENTER | WI | MILWAUKEE | 53215 | 1.073500 | 1.170233 | 0.937221 | 1.045942 | 1.234657 | 1.046721 | 1.100769 | 1.455838 | 1.544807 | 1.286938 | 1 |
7468 rows × 17 columns
print(f' The Outlier list contains {len(outliers_list.groupby("Provider_Name").count())} providers who performed {len(outliers_list.groupby("DRG").count())} procedures across {len(outliers_list.groupby("Provider_State").count())} states', '\n')
The Outlier list contains 1107 providers who performed 100 procedures across 51 states
4.1.2 Top 10 Outliers by DRG ¶
outliers_list.groupby('DRG').agg(count = pd.NamedAgg(column = 'DRG', aggfunc = 'count')).sort_values(by= 'count', ascending= False)[1:11]
| count | |
|---|---|
| DRG | |
| 329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC | 186 |
| 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATOR SUPPORT 96+ HOURS | 184 |
| 314 - OTHER CIRCULATORY SYSTEM DIAGNOSES W MCC | 171 |
| 208 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATOR SUPPORT <96 HOURS | 165 |
| 460 - SPINAL FUSION EXCEPT CERVICAL W/O MCC | 160 |
| 871 - SEPTICEMIA OR SEVERE SEPSIS W/O MV 96+ HOURS W MCC | 156 |
| 470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT OF LOWER EXTREMITY W/O MCC | 149 |
| 252 - OTHER VASCULAR PROCEDURES W MCC | 134 |
| 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | 133 |
| 682 - RENAL FAILURE W MCC | 132 |
4.1.3 Outliers by State ¶
outliers_list.groupby('Provider_State').agg(count = pd.NamedAgg(column = 'Provider_State', aggfunc = 'count')).sort_values(by= 'count', ascending= False)
| count | |
|---|---|
| Provider_State | |
| CA | 1069 |
| NY | 937 |
| TX | 725 |
| IL | 544 |
| FL | 411 |
| MD | 377 |
| PA | 355 |
| MO | 168 |
| MN | 165 |
| NM | 145 |
| MI | 143 |
| LA | 137 |
| AZ | 136 |
| WI | 135 |
| IN | 129 |
| OH | 126 |
| OK | 122 |
| GA | 119 |
| MA | 110 |
| AL | 109 |
| VA | 106 |
| WA | 103 |
| NJ | 98 |
| TN | 93 |
| AK | 78 |
| MS | 74 |
| CO | 64 |
| CT | 63 |
| KY | 63 |
| SC | 58 |
| HI | 57 |
| AR | 57 |
| DC | 56 |
| KS | 51 |
| IA | 44 |
| NC | 41 |
| OR | 38 |
| WV | 35 |
| NV | 34 |
| NE | 26 |
| SD | 20 |
| UT | 14 |
| RI | 13 |
| WY | 6 |
| NH | 5 |
| ID | 4 |
| ME | 4 |
| MT | 1 |
| DE | 0 |
| VT | 0 |
| ND | 0 |
Key Takeaways
- A total of 7,468 instances (~5% of total data) were identified as outlier
- Both training and test data show a clear majority of instances classified as normal, with a smaller percentage identified as outliers.
- The average values of ratio columns (related to Medicare payments, total payments, and out-of-pocket expenses) are higher in outlier groups across both datasets, indicating potential areas of abnormal behavior or higher variability.
- The anomaly detection model appears consistent between training and test data, with similar patterns in group distribution and anomaly scores.
- 329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC and CA state appears to be see the most fraudulent activities.
4.2 Combination by Average ¶
This approach provides a consolidated view of anomaly scores across multiple KNN models.
# Averaging multiple KNN models
# Standardize data
X_train_norm, X_test_norm = standardizer(X_train, X_test)
# Test a range of k-neighbors from 10 to 200
k_list = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110,
120, 130, 140, 150, 160, 170, 180, 190, 200]
n_clf = len(k_list)
# Prepare data frames to store the model results
train_scores = np.zeros([X_train.shape[0], n_clf])
test_scores = np.zeros([X_test.shape[0], n_clf])
# Modeling
for i, k in enumerate(k_list):
clf = KNN(n_neighbors=k, method='largest')
clf.fit(X_train_norm)
train_scores[:, i] = clf.decision_scores_
test_scores[:, i] = clf.decision_function(X_test_norm)
# Decision scores have to be normalized before combination
train_scores_norm, test_scores_norm = standardizer(train_scores, test_scores)
# Combination by average for training and test data
y_train_by_average = average(train_scores_norm)
y_test_by_average = average(test_scores_norm)
# Calculate new threshold for Combination by average
def set_threshold_by_contamination_rate(anomaly_scores, desired_contamination_rate):
threshold = np.percentile(anomaly_scores, 100 * (1 - desired_contamination_rate))
return threshold
# Example: Set the contamination rate to 0.05%
desired_contamination_rate = 0.05
comb_threshold = set_threshold_by_contamination_rate(y_train_by_average, desired_contamination_rate)
# Print the new threshold
print("New Threshold for Contamination Rate:", comb_threshold)
New Threshold for Contamination Rate: 1.6723343404873459
stats_train_comb = descriptive_stat_threshold(pd.DataFrame(X_train, columns=ratio_columns), y_train_by_average, comb_threshold)
stats_test_comb = descriptive_stat_threshold(pd.DataFrame(X_test, columns=ratio_columns), y_test_by_average, comb_threshold)
# Descriptive statistics for train data
print("Training Data Statistics:")
stats_train_comb
Training Data Statistics:
| Group | Count | Count % | Average_Medicare_byDRG_Ratio | Average_Medicare_Bystates_Ratio | Average_Medicare_Bystatescity_Ratio | Average_Medicare_Byzip_Ratio | Average_Total_Bystates_Ratio | Average_Total_Bystatescity_Ratio | Average_Total_Byzip_Ratio | Average_OOP_Bystates_Ratio | Average_OOP_Bystatescity_Ratio | Average_OOP_Byzip_Ratio | Anomaly_Score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Normal | 123929 | 94.999693 | 0.98 | 0.99 | 1.00 | 1.00 | 0.99 | 1.00 | 1.00 | 0.97 | 0.99 | 1.00 | -0.16 |
| 1 | Outlier | 6523 | 5.000307 | 1.32 | 1.23 | 1.07 | 1.01 | 1.28 | 1.08 | 1.01 | 1.63 | 1.19 | 1.01 | 3.09 |
# Descriptive statistics for test data
print("Test Data Statistics:")
stats_test_comb
Test Data Statistics:
| Group | Count | Count % | Average_Medicare_byDRG_Ratio | Average_Medicare_Bystates_Ratio | Average_Medicare_Bystatescity_Ratio | Average_Medicare_Byzip_Ratio | Average_Total_Bystates_Ratio | Average_Total_Bystatescity_Ratio | Average_Total_Byzip_Ratio | Average_OOP_Bystates_Ratio | Average_OOP_Bystatescity_Ratio | Average_OOP_Byzip_Ratio | Anomaly_Score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Normal | 30985 | 95.008126 | 0.98 | 0.99 | 1.00 | 1.00 | 0.98 | 1.00 | 1.00 | 0.97 | 0.99 | 1.00 | -0.17 |
| 1 | Outlier | 1628 | 4.991874 | 1.32 | 1.23 | 1.06 | 1.01 | 1.28 | 1.08 | 1.01 | 1.63 | 1.21 | 1.02 | 3.08 |
# Function to plot histograms with threshold annotation and color differentiation
def plot_histogram_with_threshold(scores, title, threshold):
fig = go.Figure()
# Split the scores into two parts: below the threshold and above/equal to the threshold
below_threshold = scores[scores < threshold]
above_threshold = scores[scores >= threshold]
# Add histogram traces for the two parts
fig.add_trace(go.Histogram(x=below_threshold, name='Below Threshold', marker=dict(color='#09b18e')))
fig.add_trace(go.Histogram(x=above_threshold, name='Above Threshold', marker=dict(color='#8f63e2')))
# Add vertical line annotating the threshold
fig.add_vline(x=threshold, line_dash="dash", line_color="#0d5087", annotation_text="Threshold",
annotation_position="top right", annotation=dict(font=dict(color="#0d5087", size=12)))
# Update layout
fig.update_layout(
title=title,
xaxis_title="Outlier Score",
yaxis_title="Frequency",
showlegend=True,
height=450,
width=800,
bargap=0.01,
paper_bgcolor='rgba(0,0,0,0)',
plot_bgcolor='rgba(0,0,0,0)'
)
fig.update_xaxes(range=[-1, 5])
# Show the plot
fig.show()
# Plotting the combination by average for training data
plot_histogram_with_threshold(np.array(y_train_by_average), "Training Data - Combination by Average", comb_threshold)
# Plotting the combination by average for test data
plot_histogram_with_threshold(np.array(y_test_by_average), "Test Data - Combination by Average", comb_threshold)
4.2.1 Combination by AVG Outlier List ¶
# Add a column 'Prediction_Comb_Avg' to final_df and initialize with 0 (not outlier)
final_df['Prediction_Comb_Avg'] = 0
# Mark outliers in training data
final_df.loc[X_train.index[y_train_by_average >= comb_threshold], 'Prediction_Comb_Avg'] = 1
# Mark outliers in test data
final_df.loc[X_test.index[y_test_by_average >= comb_threshold], 'Prediction_Comb_Avg'] = 1
final_df.head()
| DRG | Provider_Id | Provider_Name | Provider_StreetAddress | Provider_City | Provider_State | Provider_Zipcode | Hospital_referral_region_desp | Total_Discharges | Average_Covered_Charges | ... | Average_Medicare_Bystatescity_Ratio | Average_Medicare_Byzip_Ratio | Average_Total_Bystates_Ratio | Average_Total_Bystatescity_Ratio | Average_Total_Byzip_Ratio | Average_OOP_Bystates_Ratio | Average_OOP_Bystatescity_Ratio | Average_OOP_Byzip_Ratio | Prediction | Prediction_Comb_Avg | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10001 | SOUTHEAST ALABAMA MEDICAL CENTER | 1108 ROSS CLARK CIRCLE | DOTHAN | AL | 36301 | AL - Dothan | 91 | 32963.07 | ... | 1.038763 | 1.0 | 1.005855 | 1.037810 | 1.0 | 0.885921 | 1.033356 | 1.0 | 0 | 0 |
| 1 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10005 | MARSHALL MEDICAL CENTER SOUTH | 2505 U S HIGHWAY 431 NORTH | BOAZ | AL | 35957 | AL - Birmingham | 14 | 15131.85 | ... | 1.000000 | 1.0 | 1.007653 | 1.000000 | 1.0 | 0.708782 | 1.000000 | 1.0 | 0 | 0 |
| 2 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10006 | ELIZA COFFEE MEMORIAL HOSPITAL | 205 MARENGO STREET | FLORENCE | AL | 35631 | AL - Birmingham | 24 | 37560.37 | ... | 1.000000 | 1.0 | 0.946260 | 1.000000 | 1.0 | 0.857643 | 1.000000 | 1.0 | 0 | 0 |
| 3 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10011 | ST VINCENT'S EAST | 50 MEDICAL PARK EAST DRIVE | BIRMINGHAM | AL | 35235 | AL - Birmingham | 25 | 13998.28 | ... | 0.899200 | 1.0 | 0.943232 | 0.915208 | 1.0 | 1.126205 | 0.970586 | 1.0 | 0 | 0 |
| 4 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10016 | SHELBY BAPTIST MEDICAL CENTER | 1000 FIRST STREET NORTH | ALABASTER | AL | 35007 | AL - Birmingham | 18 | 31633.27 | ... | 1.000000 | 1.0 | 0.985152 | 1.000000 | 1.0 | 0.705312 | 1.000000 | 1.0 | 0 | 0 |
5 rows × 25 columns
# Filter out outliers
outliers_comb = final_df[final_df['Prediction_Comb_Avg'] == 1]
# Display the list of outliers
outliers_list_comb = outliers_comb[['DRG', 'Provider_Id', 'Provider_Name','Provider_State', 'Provider_City', 'Provider_Zipcode'] + ratio_columns + ['Prediction_Comb_Avg']]
print("List of Outliers:")
outliers_list_comb.head(5)
List of Outliers:
| DRG | Provider_Id | Provider_Name | Provider_State | Provider_City | Provider_Zipcode | Average_Medicare_byDRG_Ratio | Average_Medicare_Bystates_Ratio | Average_Medicare_Bystatescity_Ratio | Average_Medicare_Byzip_Ratio | Average_Total_Bystates_Ratio | Average_Total_Bystatescity_Ratio | Average_Total_Byzip_Ratio | Average_OOP_Bystates_Ratio | Average_OOP_Bystatescity_Ratio | Average_OOP_Byzip_Ratio | Prediction_Comb_Avg | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 104 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 50334 | SALINAS VALLEY MEMORIAL HOSPITAL | CA | SALINAS | 93901 | 1.269125 | 1.025018 | 1.000000 | 1.00000 | 1.996299 | 1.000000 | 1.000000 | 6.424064 | 1.000000 | 1.0000 | 1 |
| 245 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 110010 | EMORY UNIVERSITY HOSPITAL | GA | ATLANTA | 30322 | 1.090453 | 1.179870 | 1.216306 | 1.00000 | 1.514428 | 1.408846 | 1.000000 | 2.853551 | 1.908965 | 1.0000 | 1 |
| 280 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 140008 | LOYOLA GOTTLIEB MEMORIAL HOSPITAL | IL | MELROSE PARK | 60160 | 0.792296 | 0.817234 | 1.000000 | 1.00000 | 2.497097 | 1.000000 | 1.000000 | 7.042845 | 1.000000 | 1.0000 | 1 |
| 294 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 140080 | RHC ST FRANCIS HOSPITAL | IL | EVANSTON | 60202 | 1.248071 | 1.287354 | 1.103888 | 1.00000 | 2.351565 | 1.390816 | 1.000000 | 5.231343 | 1.681938 | 1.0000 | 1 |
| 356 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 150109 | FRANCISCAN ST ELIZABETH HEALTH - LAFAYETTE EAST | IN | LAFAYETTE | 47905 | 0.925007 | 1.033613 | 1.170310 | 1.17031 | 0.984917 | 0.838043 | 0.838043 | 0.818963 | 0.377300 | 0.3773 | 1 |
print(f' The Combination by AVG Outlier list contains {len(outliers_list_comb.groupby("Provider_Name").count())} providers who performed {len(outliers_list_comb.groupby("DRG").count())} procedures accross {len(outliers_list_comb.groupby("Provider_State").count())} states', '\n')
The Combination by AVG Outlier list contains 894 providers who performed 100 procedures accross 51 states
4.2.2 Top 10 Outlier by DRG ¶
outliers_list_comb.groupby('DRG').agg(count = pd.NamedAgg(column = 'DRG', aggfunc = 'count')).sort_values(by= 'count', ascending= False)[1:11]
| count | |
|---|---|
| DRG | |
| 470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT OF LOWER EXTREMITY W/O MCC | 171 |
| 291 - HEART FAILURE & SHOCK W MCC | 165 |
| 208 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATOR SUPPORT <96 HOURS | 161 |
| 853 - INFECTIOUS & PARASITIC DISEASES W O.R. PROCEDURE W MCC | 159 |
| 329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC | 150 |
| 460 - SPINAL FUSION EXCEPT CERVICAL W/O MCC | 149 |
| 682 - RENAL FAILURE W MCC | 144 |
| 314 - OTHER CIRCULATORY SYSTEM DIAGNOSES W MCC | 143 |
| 392 - ESOPHAGITIS, GASTROENT & MISC DIGEST DISORDERS W/O MCC | 139 |
| 292 - HEART FAILURE & SHOCK W CC | 129 |
4.2.3 Outlier by State ¶
outliers_list_comb.groupby('Provider_State').agg(count = pd.NamedAgg(column = 'Provider_State', aggfunc = 'count')).sort_values(by= 'count', ascending= False)
| count | |
|---|---|
| Provider_State | |
| NY | 1190 |
| CA | 1092 |
| TX | 969 |
| IL | 605 |
| FL | 449 |
| PA | 416 |
| MD | 336 |
| MO | 293 |
| OK | 202 |
| LA | 201 |
| NM | 201 |
| MS | 156 |
| WI | 146 |
| MI | 134 |
| MN | 133 |
| GA | 125 |
| IN | 121 |
| OH | 114 |
| AR | 113 |
| KY | 104 |
| AL | 86 |
| TN | 81 |
| VA | 80 |
| WA | 80 |
| AZ | 79 |
| NJ | 75 |
| CO | 55 |
| WV | 52 |
| HI | 50 |
| KS | 50 |
| DC | 49 |
| MA | 43 |
| AK | 39 |
| OR | 36 |
| IA | 36 |
| NC | 31 |
| SC | 29 |
| SD | 22 |
| NV | 19 |
| ME | 15 |
| NE | 14 |
| NH | 10 |
| UT | 8 |
| CT | 7 |
| ID | 2 |
| MT | 1 |
| RI | 1 |
| WY | 1 |
| VT | 0 |
| DE | 0 |
| ND | 0 |
Key Takeaways
- A total of 8,151 (5% of total data) were identified as outlier.
- The combination by average method provides a consolidated view of anomaly scores across multiple KNN models.
- Both training and test datasets show a majority of instances classified as normal, with a smaller percentage identified as outliers.
- The average anomaly scores for the outlier group are consistently higher than those for the normal group, indicating effective identification of anomalies by the combined model.
- 470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT OF LOWER EXTREMITY W/O MCC and NY state appears to see the most fraudulent activities.
PCA helps in reducing the number of features while retaining the most important information. This approach method reduces computational cost by reducing the number of dimensions. By reducing the number of features, it helps in mitigating overfittinh in our models.
pca = PCA(contamination=0.05)
pca.fit(X_train)
# get the prediction labels and outlier scores of the training data
y_train_pred_pca = pca.labels_ # binary labels (0: inliers, 1: outliers)
y_train_scores_pca = pca.decision_scores_ # .decision_scores_ yields the raw outlier scores for the training data
y_train_scores_pca = pca.decision_function(X_train)
y_train_pred_pca = pca.predict(X_train)
y_test_scores_pca = pca.decision_function(X_test)
y_test_pred_pca = pca.predict(X_test)
pca.get_params()
{'contamination': 0.05,
'copy': True,
'iterated_power': 'auto',
'n_components': None,
'n_selected_components': None,
'random_state': None,
'standardization': True,
'svd_solver': 'auto',
'tol': 0.0,
'weighted': True,
'whiten': False}
[pca.explained_variance_,
pca.explained_variance_ratio_]
[array([3.85181529, 1.9101038 , 1.66358007, 0.91376819, 0.85387575,
0.48865034, 0.233276 , 0.04869219, 0.02381848, 0.01249654]),
array([0.38517858, 0.19100892, 0.16635673, 0.09137612, 0.08538692,
0.04886466, 0.02332742, 0.00486918, 0.00238183, 0.00124964])]
pca_threshold = pca.threshold_
print("The threshold for the defined comtanimation rate:" , pca.threshold_)
The threshold for the defined comtanimation rate: 10158.968465695625
stats_pca_train=descriptive_stat_threshold(pd.DataFrame(X_train, columns=ratio_columns), y_train_scores_pca, pca_threshold)
stats_pca_test=descriptive_stat_threshold(pd.DataFrame(X_test, columns=ratio_columns), y_test_scores_pca, pca_threshold)
# Descriptive statistics for train data
print("Training Data Statistics:")
stats_pca_train
Training Data Statistics:
| Group | Count | Count % | Average_Medicare_byDRG_Ratio | Average_Medicare_Bystates_Ratio | Average_Medicare_Bystatescity_Ratio | Average_Medicare_Byzip_Ratio | Average_Total_Bystates_Ratio | Average_Total_Bystatescity_Ratio | Average_Total_Byzip_Ratio | Average_OOP_Bystates_Ratio | Average_OOP_Bystatescity_Ratio | Average_OOP_Byzip_Ratio | Anomaly_Score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Normal | 123929 | 94.999693 | 0.98 | 0.98 | 0.99 | 1.00 | 0.98 | 0.99 | 1.00 | 0.97 | 0.99 | 1.0 | 3362.86 |
| 1 | Outlier | 6523 | 5.000307 | 1.40 | 1.33 | 1.13 | 1.01 | 1.36 | 1.14 | 1.01 | 1.60 | 1.16 | 1.0 | 14465.32 |
# Descriptive statistics for test data
print("Test Data Statistics:")
stats_pca_test
Test Data Statistics:
| Group | Count | Count % | Average_Medicare_byDRG_Ratio | Average_Medicare_Bystates_Ratio | Average_Medicare_Bystatescity_Ratio | Average_Medicare_Byzip_Ratio | Average_Total_Bystates_Ratio | Average_Total_Bystatescity_Ratio | Average_Total_Byzip_Ratio | Average_OOP_Bystates_Ratio | Average_OOP_Bystatescity_Ratio | Average_OOP_Byzip_Ratio | Anomaly_Score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Normal | 31028 | 95.139975 | 0.98 | 0.98 | 0.99 | 1.00 | 0.98 | 0.99 | 1.00 | 0.97 | 0.99 | 1.00 | 3364.83 |
| 1 | Outlier | 1585 | 4.860025 | 1.39 | 1.32 | 1.12 | 1.01 | 1.35 | 1.13 | 1.01 | 1.62 | 1.18 | 1.01 | 14402.64 |
# Function to plot histograms with threshold annotation and color differentiation
def plot_histogram_with_threshold(scores, title, threshold):
fig = go.Figure()
# Split the scores into two parts: below the threshold and above/equal to the threshold
below_threshold = scores[scores < threshold]
above_threshold = scores[scores >= threshold]
# Add histogram traces for the two parts
fig.add_trace(go.Histogram(x=below_threshold, name='Below Threshold', marker=dict(color='#09b18e')))
fig.add_trace(go.Histogram(x=above_threshold, name='Above Threshold', marker=dict(color='#8f63e2')))
# Add vertical line annotating the threshold
fig.add_vline(x=threshold, line_dash="dash", line_color="#0d5087", annotation_text="Threshold",
annotation_position="top right", annotation=dict(font=dict(color="#0d5087", size=12)))
# Update layout
fig.update_layout(
title=title,
xaxis_title="Outlier Score",
yaxis_title="Frequency",
showlegend=True,
height=500,
width=800,
paper_bgcolor='rgba(0,0,0,0)',
plot_bgcolor='rgba(0,0,0,0)'
)
# Show the plot
fig.show()
# Plotting the combination by average for training data
plot_histogram_with_threshold(y_train_scores_pca, "PCA - Training Data", pca_threshold)
# Plotting the combination by average for test data
plot_histogram_with_threshold(y_test_scores_pca, "PCA - Test Data", pca_threshold)
5.1.1 PCA Outlier List¶
# Add a column 'Prediction_PCA' to final_df and initialize with 0 (not outlier)
final_df['Prediction_PCA'] = 0
# Mark outliers in training data
final_df.loc[X_train.index, 'Prediction_PCA'] = y_train_pred_pca
# Mark outliers in test data
final_df.loc[X_test.index, 'Prediction_PCA'] = y_test_pred_pca
final_df.head()
| DRG | Provider_Id | Provider_Name | Provider_StreetAddress | Provider_City | Provider_State | Provider_Zipcode | Hospital_referral_region_desp | Total_Discharges | Average_Covered_Charges | ... | Average_Medicare_Byzip_Ratio | Average_Total_Bystates_Ratio | Average_Total_Bystatescity_Ratio | Average_Total_Byzip_Ratio | Average_OOP_Bystates_Ratio | Average_OOP_Bystatescity_Ratio | Average_OOP_Byzip_Ratio | Prediction | Prediction_Comb_Avg | Prediction_PCA | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10001 | SOUTHEAST ALABAMA MEDICAL CENTER | 1108 ROSS CLARK CIRCLE | DOTHAN | AL | 36301 | AL - Dothan | 91 | 32963.07 | ... | 1.0 | 1.005855 | 1.037810 | 1.0 | 0.885921 | 1.033356 | 1.0 | 0 | 0 | 0 |
| 1 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10005 | MARSHALL MEDICAL CENTER SOUTH | 2505 U S HIGHWAY 431 NORTH | BOAZ | AL | 35957 | AL - Birmingham | 14 | 15131.85 | ... | 1.0 | 1.007653 | 1.000000 | 1.0 | 0.708782 | 1.000000 | 1.0 | 0 | 0 | 0 |
| 2 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10006 | ELIZA COFFEE MEMORIAL HOSPITAL | 205 MARENGO STREET | FLORENCE | AL | 35631 | AL - Birmingham | 24 | 37560.37 | ... | 1.0 | 0.946260 | 1.000000 | 1.0 | 0.857643 | 1.000000 | 1.0 | 0 | 0 | 0 |
| 3 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10011 | ST VINCENT'S EAST | 50 MEDICAL PARK EAST DRIVE | BIRMINGHAM | AL | 35235 | AL - Birmingham | 25 | 13998.28 | ... | 1.0 | 0.943232 | 0.915208 | 1.0 | 1.126205 | 0.970586 | 1.0 | 0 | 0 | 0 |
| 4 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10016 | SHELBY BAPTIST MEDICAL CENTER | 1000 FIRST STREET NORTH | ALABASTER | AL | 35007 | AL - Birmingham | 18 | 31633.27 | ... | 1.0 | 0.985152 | 1.000000 | 1.0 | 0.705312 | 1.000000 | 1.0 | 0 | 0 | 0 |
5 rows × 26 columns
# Filter out outliers
outliers_pca = final_df[final_df['Prediction_PCA'] == 1]
# Display the list of outliers
outliers_list_pca = outliers_pca[['DRG', 'Provider_Id', 'Provider_Name','Provider_State', 'Provider_City', 'Provider_Zipcode'] + ratio_columns + ['Prediction_PCA']]
print("List of Outliers:")
outliers_list_pca.head(5)
List of Outliers:
| DRG | Provider_Id | Provider_Name | Provider_State | Provider_City | Provider_Zipcode | Average_Medicare_byDRG_Ratio | Average_Medicare_Bystates_Ratio | Average_Medicare_Bystatescity_Ratio | Average_Medicare_Byzip_Ratio | Average_Total_Bystates_Ratio | Average_Total_Bystatescity_Ratio | Average_Total_Byzip_Ratio | Average_OOP_Bystates_Ratio | Average_OOP_Bystatescity_Ratio | Average_OOP_Byzip_Ratio | Prediction_PCA | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 104 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 50334 | SALINAS VALLEY MEMORIAL HOSPITAL | CA | SALINAS | 93901 | 1.269125 | 1.025018 | 1.000000 | 1.0 | 1.996299 | 1.000000 | 1.0 | 6.424064 | 1.000000 | 1.0 | 1 |
| 245 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 110010 | EMORY UNIVERSITY HOSPITAL | GA | ATLANTA | 30322 | 1.090453 | 1.179870 | 1.216306 | 1.0 | 1.514428 | 1.408846 | 1.0 | 2.853551 | 1.908965 | 1.0 | 1 |
| 280 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 140008 | LOYOLA GOTTLIEB MEMORIAL HOSPITAL | IL | MELROSE PARK | 60160 | 0.792296 | 0.817234 | 1.000000 | 1.0 | 2.497097 | 1.000000 | 1.0 | 7.042845 | 1.000000 | 1.0 | 1 |
| 294 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 140080 | RHC ST FRANCIS HOSPITAL | IL | EVANSTON | 60202 | 1.248071 | 1.287354 | 1.103888 | 1.0 | 2.351565 | 1.390816 | 1.0 | 5.231343 | 1.681938 | 1.0 | 1 |
| 347 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 150056 | INDIANA UNIVERSITY HEALTH | IN | INDIANAPOLIS | 46206 | 1.269749 | 1.418832 | 1.359352 | 1.0 | 1.518408 | 1.450774 | 1.0 | 1.857759 | 1.758615 | 1.0 | 1 |
print(f' The PCA Outlier list contains {len(outliers_list_pca.groupby("Provider_Name").count())} providers who performed {len(outliers_list_pca.groupby("DRG").count())} procedures across {len(outliers_list_pca.groupby("Provider_State").count())} states', '\n')
The PCA Outlier list contains 895 providers who performed 100 procedures across 51 states
5.1.2 Top 10 Outliers by DRG¶
outliers_list_pca.groupby('DRG').agg(count = pd.NamedAgg(column = 'DRG', aggfunc = 'count')).sort_values(by= 'count', ascending= False)[1:11]
| count | |
|---|---|
| DRG | |
| 470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT OF LOWER EXTREMITY W/O MCC | 158 |
| 291 - HEART FAILURE & SHOCK W MCC | 152 |
| 208 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATOR SUPPORT <96 HOURS | 147 |
| 853 - INFECTIOUS & PARASITIC DISEASES W O.R. PROCEDURE W MCC | 143 |
| 329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC | 142 |
| 682 - RENAL FAILURE W MCC | 135 |
| 392 - ESOPHAGITIS, GASTROENT & MISC DIGEST DISORDERS W/O MCC | 134 |
| 292 - HEART FAILURE & SHOCK W CC | 132 |
| 314 - OTHER CIRCULATORY SYSTEM DIAGNOSES W MCC | 132 |
| 194 - SIMPLE PNEUMONIA & PLEURISY W CC | 131 |
5.1.3 Outliers by State¶
outliers_list_pca.groupby('Provider_State').agg(count = pd.NamedAgg(column = 'Provider_State', aggfunc = 'count')).sort_values(by= 'count', ascending= False)
| count | |
|---|---|
| Provider_State | |
| NY | 1073 |
| CA | 1054 |
| TX | 923 |
| IL | 501 |
| FL | 429 |
| PA | 378 |
| MO | 317 |
| OK | 287 |
| MD | 258 |
| NM | 181 |
| MS | 178 |
| IN | 175 |
| VA | 163 |
| LA | 157 |
| OH | 146 |
| MN | 141 |
| GA | 134 |
| AR | 130 |
| KY | 126 |
| AZ | 124 |
| TN | 108 |
| SC | 101 |
| WI | 97 |
| NJ | 89 |
| WA | 88 |
| MI | 82 |
| AL | 78 |
| WV | 77 |
| CO | 70 |
| DC | 56 |
| OR | 49 |
| MA | 45 |
| IA | 42 |
| HI | 38 |
| NC | 37 |
| AK | 32 |
| NV | 24 |
| NE | 23 |
| KS | 23 |
| SD | 22 |
| ME | 20 |
| UT | 13 |
| NH | 9 |
| CT | 7 |
| ID | 2 |
| RI | 1 |
| ND | 0 |
| MT | 0 |
| VT | 0 |
| DE | 0 |
| WY | 0 |
Key Takeaways
- A total of 8,180 (5% of total data) were identified as outlier.
- The average anomaly score for outliers (14,465.32) is much higher compared to the normal group (3,362.86), indicating that outliers exhibit more extreme deviations in their feature values.
- 470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT OF LOWER EXTREMITY W/O MCC and NY state appears to see the most fraudulent activities.
Section 6: Model Comparison¶
########
# HBOS #
########
from pyod.models.hbos import HBOS
n_bins = 50
hbos = HBOS(n_bins=n_bins,contamination=0.05)
hbos.fit(X_train)
y_train_hbos_scores = hbos.decision_function(X_train)
y_test_hbos_scores = hbos.decision_function(X_test)
########
# ECOD #
########
from pyod.models.ecod import ECOD
clf_name = 'ECOD'
ecod = ECOD(contamination=0.05)
ecod.fit(X_train)
y_test_ecod_scores = ecod.decision_function(X_test)
########
# PCA #
########
from pyod.models.pca import PCA
pca = PCA(contamination=0.05)
pca.fit(X_train)
y_test_pca_scores = pca.decision_function(X_test)
########
# KNN ##
########
knn = KNN(contamination=0.05)
knn.fit(X_train)
y_test_knn_scores = knn.decision_function(X_test)
# Thresholds
[ecod.threshold_, hbos.threshold_, pca.threshold_, knn.threshold_]
[26.6950577459844, 3.340918013562263, 10158.968465695625, 0.20077006129635047]
# Put the actual, the HBO score and the ECOD score together
Actual_preds = pd.DataFrame({'HBO_Score': y_test_hbos_scores,
'ECOD_Score': y_test_ecod_scores,
'PCA_Score': y_test_pca_scores,
'KNN_Score': y_test_knn_scores})
Actual_preds['HBOS_pred'] = np.where(Actual_preds['HBO_Score']>hbos.threshold_,1,0)
Actual_preds['ECOD_pred'] = np.where(Actual_preds['ECOD_Score']>ecod.threshold_,1,0)
Actual_preds['PCA_pred'] = np.where(Actual_preds['PCA_Score']>pca.threshold_,1,0)
Actual_preds['KNN_pred'] = np.where(Actual_preds['KNN_Score']>knn.threshold_,1,0)
Actual_preds.head()
| HBO_Score | ECOD_Score | PCA_Score | KNN_Score | HBOS_pred | ECOD_pred | PCA_pred | KNN_pred | |
|---|---|---|---|---|---|---|---|---|
| 0 | -26.765132 | 7.122095 | 2450.436977 | 0.014515 | 0 | 0 | 0 | 0 |
| 1 | -26.765132 | 8.352012 | 2660.476899 | 0.026798 | 0 | 0 | 0 | 0 |
| 2 | -24.073842 | 6.490236 | 2021.072391 | 0.018293 | 0 | 0 | 0 | 0 |
| 3 | -27.563327 | 5.262983 | 2137.255685 | 0.011209 | 0 | 0 | 0 | 0 |
| 4 | -25.121238 | 5.724757 | 2047.587991 | 0.025255 | 0 | 0 | 0 | 0 |
HBOS vs PCA¶
pd.crosstab(Actual_preds['HBOS_pred'],Actual_preds['PCA_pred'])
| PCA_pred | 0 | 1 |
|---|---|---|
| HBOS_pred | ||
| 0 | 30364 | 685 |
| 1 | 664 | 900 |
# Compute confusion matrix
cm = pd.crosstab(Actual_preds['HBOS_pred'],Actual_preds['PCA_pred'])
# Plot confusion matrix as heatmap
trace2 = go.Heatmap(z=cm,
x=['Predicted 0', 'Predicted 1'],
y=['True 0', 'True 1'],
showscale=False,
colorscale=[
[0.0, "#09b18e"],
[0.0015, "#0d5087"],
[1.0, "#ffffff"],
],
xgap=20,
ygap=20,
text=cm,
texttemplate="%{text}")
# Define layout
layout = go.Layout(
title=dict(text="Confusion Matrix between HBOS and PCA", x=0.5, y=0.9, xanchor='center', yanchor='top'),
xaxis=dict(title='PCA Pred', showticklabels=True),
yaxis=dict(title='HBOS Pred', showticklabels=True),
autosize=False,
width=500,
height=500,
plot_bgcolor='rgba(0,0,0,0)',
paper_bgcolor='rgba(0,0,0,0)'
)
# Plot heatmap
fig = go.Figure(data=[trace2], layout=layout)
fig.show()
ECOD vs PCA¶
pd.crosstab(Actual_preds['ECOD_pred'],Actual_preds['PCA_pred'])
| PCA_pred | 0 | 1 |
|---|---|---|
| ECOD_pred | ||
| 0 | 30552 | 447 |
| 1 | 476 | 1138 |
# Compute confusion matrix
cm = pd.crosstab(Actual_preds['ECOD_pred'],Actual_preds['PCA_pred'])
# Plot confusion matrix as heatmap
trace2 = go.Heatmap(z=cm,
x=['Predicted 0', 'Predicted 1'],
y=['True 0', 'True 1'],
showscale=False,
colorscale=[
[0.0, "#09b18e"],
[0.0015, "#0d5087"],
[1.0, "#ffffff"],
],
xgap=20,
ygap=20,
text=cm,
texttemplate="%{text}")
# Define layout
layout = go.Layout(
title=dict(text="Confusion Matrix between ECOD and PCA", x=0.5, y=0.9, xanchor='center', yanchor='top'),
xaxis=dict(title='PCA Pred', showticklabels=True),
yaxis=dict(title='ECOD Pred', showticklabels=True),
autosize=False,
width=500,
height=500,
plot_bgcolor='rgba(0,0,0,0)',
paper_bgcolor='rgba(0,0,0,0)'
)
# Plot heatmap
fig = go.Figure(data=[trace2], layout=layout)
fig.show()
KNN vs PCA¶
pd.crosstab(Actual_preds['KNN_pred'],Actual_preds['PCA_pred'])
| PCA_pred | 0 | 1 |
|---|---|---|
| KNN_pred | ||
| 0 | 30393 | 584 |
| 1 | 635 | 1001 |
# Compute confusion matrix
cm = pd.crosstab(Actual_preds['KNN_pred'],Actual_preds['PCA_pred'])
# Plot confusion matrix as heatmap
trace2 = go.Heatmap(z=cm,
x=['Predicted 0', 'Predicted 1'],
y=['True 0', 'True 1'],
showscale=False,
colorscale=[
[0.0, "#09b18e"],
[0.017, "#0d5087"],
[1.0, "#ffffff"],
],
xgap=20,
ygap=20,
text=cm,
texttemplate="%{text}")
# Define layout
layout = go.Layout(
title=dict(text="Confusion Matrix between KNN and PCA", x=0.5, y=0.9, xanchor='center', yanchor='top'),
xaxis=dict(title='PCA Pred', showticklabels=True),
yaxis=dict(title='KNN Pred', showticklabels=True),
autosize=False,
width=500,
height=500,
plot_bgcolor='rgba(0,0,0,0)',
paper_bgcolor='rgba(0,0,0,0)'
)
# Plot heatmap
fig = go.Figure(data=[trace2], layout=layout)
fig.show()
Key Takeaways
- KNN and PCA performance are similar with minimal differences.
- All models were able to identify more than 30k instances as normal along with 900-1,140 instances as outliers with some discrepancies between false negatives and false positive.